// text_32 returns the suffix array for the input text. // It requires that len(text) fit in an int32 // and that the caller zero sa. func text_32(text []byte, sa []int32) { … } // sais_8_32 computes the suffix array of text. // The text must contain only values in [0, textMax). // The suffix array is stored in sa, which the caller // must ensure is already zeroed. // The caller must also provide temporary space tmp // with len(tmp) ≥ textMax. If len(tmp) ≥ 2*textMax // then the algorithm runs a little faster. // If sais_8_32 modifies tmp, it sets tmp[0] = -1 on return. func sais_8_32(text []byte, textMax int, sa, tmp []int32) { … } // freq_8_32 returns the character frequencies // for text, as a slice indexed by character value. // If freq is nil, freq_8_32 uses and returns bucket. // If freq is non-nil, freq_8_32 assumes that freq[0] >= 0 // means the frequencies are already computed. // If the frequency data is overwritten or uninitialized, // the caller must set freq[0] = -1 to force recomputation // the next time it is needed. func freq_8_32(text []byte, freq, bucket []int32) []int32 { … } // bucketMin_8_32 stores into bucket[c] the minimum index // in the bucket for character c in a bucket-sort of text. func bucketMin_8_32(text []byte, freq, bucket []int32) { … } // bucketMax_8_32 stores into bucket[c] the maximum index // in the bucket for character c in a bucket-sort of text. // The bucket indexes for c are [min, max). // That is, max is one past the final index in that bucket. func bucketMax_8_32(text []byte, freq, bucket []int32) { … } // placeLMS_8_32 places into sa the indexes of the // final characters of the LMS substrings of text, // sorted into the rightmost ends of their correct buckets // in the suffix array. // // The imaginary sentinel character at the end of the text // is the final character of the final LMS substring, but there // is no bucket for the imaginary sentinel character, // which has a smaller value than any real character. // The caller must therefore pretend that sa[-1] == len(text). // // The text indexes of LMS-substring characters are always ≥ 1 // (the first LMS-substring must be preceded by one or more L-type // characters that are not part of any LMS-substring), // so using 0 as a “not present” suffix array entry is safe, // both in this function and in most later functions // (until induceL_8_32 below). func placeLMS_8_32(text []byte, sa, freq, bucket []int32) int { … } // induceSubL_8_32 inserts the L-type text indexes of LMS-substrings // into sa, assuming that the final characters of the LMS-substrings // are already inserted into sa, sorted by final character, and at the // right (not left) end of the corresponding character bucket. // Each LMS-substring has the form (as a regexp) /S+L+S/: // one or more S-type, one or more L-type, final S-type. // induceSubL_8_32 leaves behind only the leftmost L-type text // index for each LMS-substring. That is, it removes the final S-type // indexes that are present on entry, and it inserts but then removes // the interior L-type indexes too. // (Only the leftmost L-type index is needed by induceSubS_8_32.) func induceSubL_8_32(text []byte, sa, freq, bucket []int32) { … } // induceSubS_8_32 inserts the S-type text indexes of LMS-substrings // into sa, assuming that the leftmost L-type text indexes are already // inserted into sa, sorted by LMS-substring suffix, and at the // left end of the corresponding character bucket. // Each LMS-substring has the form (as a regexp) /S+L+S/: // one or more S-type, one or more L-type, final S-type. // induceSubS_8_32 leaves behind only the leftmost S-type text // index for each LMS-substring, in sorted order, at the right end of sa. // That is, it removes the L-type indexes that are present on entry, // and it inserts but then removes the interior S-type indexes too, // leaving the LMS-substring start indexes packed into sa[len(sa)-numLMS:]. // (Only the LMS-substring start indexes are processed by the recursion.) func induceSubS_8_32(text []byte, sa, freq, bucket []int32) { … } // length_8_32 computes and records the length of each LMS-substring in text. // The length of the LMS-substring at index j is stored at sa[j/2], // avoiding the LMS-substring indexes already stored in the top half of sa. // (If index j is an LMS-substring start, then index j-1 is type L and cannot be.) // There are two exceptions, made for optimizations in name_8_32 below. // // First, the final LMS-substring is recorded as having length 0, which is otherwise // impossible, instead of giving it a length that includes the implicit sentinel. // This ensures the final LMS-substring has length unequal to all others // and therefore can be detected as different without text comparison // (it is unequal because it is the only one that ends in the implicit sentinel, // and the text comparison would be problematic since the implicit sentinel // is not actually present at text[len(text)]). // // Second, to avoid text comparison entirely, if an LMS-substring is very short, // sa[j/2] records its actual text instead of its length, so that if two such // substrings have matching “length,” the text need not be read at all. // The definition of “very short” is that the text bytes must pack into a uint32, // and the unsigned encoding e must be ≥ len(text), so that it can be // distinguished from a valid length. func length_8_32(text []byte, sa []int32, numLMS int) { … } // assignID_8_32 assigns a dense ID numbering to the // set of LMS-substrings respecting string ordering and equality, // returning the maximum assigned ID. // For example given the input "ababab", the LMS-substrings // are "aba", "aba", and "ab", renumbered as 2 2 1. // sa[len(sa)-numLMS:] holds the LMS-substring indexes // sorted in string order, so to assign numbers we can // consider each in turn, removing adjacent duplicates. // The new ID for the LMS-substring at index j is written to sa[j/2], // overwriting the length previously stored there (by length_8_32 above). func assignID_8_32(text []byte, sa []int32, numLMS int) int { … } // map_32 maps the LMS-substrings in text to their new IDs, // producing the subproblem for the recursion. // The mapping itself was mostly applied by assignID_8_32: // sa[i] is either 0, the ID for the LMS-substring at index 2*i, // or the ID for the LMS-substring at index 2*i+1. // To produce the subproblem we need only remove the zeros // and change ID into ID-1 (our IDs start at 1, but text chars start at 0). // // map_32 packs the result, which is the input to the recursion, // into the top of sa, so that the recursion result can be stored // in the bottom of sa, which sets up for expand_8_32 well. func map_32(sa []int32, numLMS int) { … } // recurse_32 calls sais_32 recursively to solve the subproblem we've built. // The subproblem is at the right end of sa, the suffix array result will be // written at the left end of sa, and the middle of sa is available for use as // temporary frequency and bucket storage. func recurse_32(sa, oldTmp []int32, numLMS, maxID int) { … } // unmap_8_32 unmaps the subproblem back to the original. // sa[:numLMS] is the LMS-substring numbers, which don't matter much anymore. // sa[len(sa)-numLMS:] is the sorted list of those LMS-substring numbers. // The key part is that if the list says K that means the K'th substring. // We can replace sa[:numLMS] with the indexes of the LMS-substrings. // Then if the list says K it really means sa[K]. // Having mapped the list back to LMS-substring indexes, // we can place those into the right buckets. func unmap_8_32(text []byte, sa []int32, numLMS int) { … } // expand_8_32 distributes the compacted, sorted LMS-suffix indexes // from sa[:numLMS] into the tops of the appropriate buckets in sa, // preserving the sorted order and making room for the L-type indexes // to be slotted into the sorted sequence by induceL_8_32. func expand_8_32(text []byte, freq, bucket, sa []int32, numLMS int) { … } // induceL_8_32 inserts L-type text indexes into sa, // assuming that the leftmost S-type indexes are inserted // into sa, in sorted order, in the right bucket halves. // It leaves all the L-type indexes in sa, but the // leftmost L-type indexes are negated, to mark them // for processing by induceS_8_32. func induceL_8_32(text []byte, sa, freq, bucket []int32) { … } func induceS_8_32(text []byte, sa, freq, bucket []int32) { … }