sais.go | Explore in Territory

// text_32 returns the suffix array for the input text.
// It requires that len(text) fit in an int32
// and that the caller zero sa.
func text_32(text []byte, sa []int32) { … }

// sais_8_32 computes the suffix array of text.
// The text must contain only values in [0, textMax).
// The suffix array is stored in sa, which the caller
// must ensure is already zeroed.
// The caller must also provide temporary space tmp
// with len(tmp) ≥ textMax. If len(tmp) ≥ 2*textMax
// then the algorithm runs a little faster.
// If sais_8_32 modifies tmp, it sets tmp[0] = -1 on return.
func sais_8_32(text []byte, textMax int, sa, tmp []int32) { … }

// freq_8_32 returns the character frequencies
// for text, as a slice indexed by character value.
// If freq is nil, freq_8_32 uses and returns bucket.
// If freq is non-nil, freq_8_32 assumes that freq[0] >= 0
// means the frequencies are already computed.
// If the frequency data is overwritten or uninitialized,
// the caller must set freq[0] = -1 to force recomputation
// the next time it is needed.
func freq_8_32(text []byte, freq, bucket []int32) []int32 { … }

// bucketMin_8_32 stores into bucket[c] the minimum index
// in the bucket for character c in a bucket-sort of text.
func bucketMin_8_32(text []byte, freq, bucket []int32) { … }

// bucketMax_8_32 stores into bucket[c] the maximum index
// in the bucket for character c in a bucket-sort of text.
// The bucket indexes for c are [min, max).
// That is, max is one past the final index in that bucket.
func bucketMax_8_32(text []byte, freq, bucket []int32) { … }

// placeLMS_8_32 places into sa the indexes of the
// final characters of the LMS substrings of text,
// sorted into the rightmost ends of their correct buckets
// in the suffix array.
//
// The imaginary sentinel character at the end of the text
// is the final character of the final LMS substring, but there
// is no bucket for the imaginary sentinel character,
// which has a smaller value than any real character.
// The caller must therefore pretend that sa[-1] == len(text).
//
// The text indexes of LMS-substring characters are always ≥ 1
// (the first LMS-substring must be preceded by one or more L-type
// characters that are not part of any LMS-substring),
// so using 0 as a “not present” suffix array entry is safe,
// both in this function and in most later functions
// (until induceL_8_32 below).
func placeLMS_8_32(text []byte, sa, freq, bucket []int32) int { … }

// induceSubL_8_32 inserts the L-type text indexes of LMS-substrings
// into sa, assuming that the final characters of the LMS-substrings
// are already inserted into sa, sorted by final character, and at the
// right (not left) end of the corresponding character bucket.
// Each LMS-substring has the form (as a regexp) /S+L+S/:
// one or more S-type, one or more L-type, final S-type.
// induceSubL_8_32 leaves behind only the leftmost L-type text
// index for each LMS-substring. That is, it removes the final S-type
// indexes that are present on entry, and it inserts but then removes
// the interior L-type indexes too.
// (Only the leftmost L-type index is needed by induceSubS_8_32.)
func induceSubL_8_32(text []byte, sa, freq, bucket []int32) { … }

// induceSubS_8_32 inserts the S-type text indexes of LMS-substrings
// into sa, assuming that the leftmost L-type text indexes are already
// inserted into sa, sorted by LMS-substring suffix, and at the
// left end of the corresponding character bucket.
// Each LMS-substring has the form (as a regexp) /S+L+S/:
// one or more S-type, one or more L-type, final S-type.
// induceSubS_8_32 leaves behind only the leftmost S-type text
// index for each LMS-substring, in sorted order, at the right end of sa.
// That is, it removes the L-type indexes that are present on entry,
// and it inserts but then removes the interior S-type indexes too,
// leaving the LMS-substring start indexes packed into sa[len(sa)-numLMS:].
// (Only the LMS-substring start indexes are processed by the recursion.)
func induceSubS_8_32(text []byte, sa, freq, bucket []int32) { … }

// length_8_32 computes and records the length of each LMS-substring in text.
// The length of the LMS-substring at index j is stored at sa[j/2],
// avoiding the LMS-substring indexes already stored in the top half of sa.
// (If index j is an LMS-substring start, then index j-1 is type L and cannot be.)
// There are two exceptions, made for optimizations in name_8_32 below.
//
// First, the final LMS-substring is recorded as having length 0, which is otherwise
// impossible, instead of giving it a length that includes the implicit sentinel.
// This ensures the final LMS-substring has length unequal to all others
// and therefore can be detected as different without text comparison
// (it is unequal because it is the only one that ends in the implicit sentinel,
// and the text comparison would be problematic since the implicit sentinel
// is not actually present at text[len(text)]).
//
// Second, to avoid text comparison entirely, if an LMS-substring is very short,
// sa[j/2] records its actual text instead of its length, so that if two such
// substrings have matching “length,” the text need not be read at all.
// The definition of “very short” is that the text bytes must pack into a uint32,
// and the unsigned encoding e must be ≥ len(text), so that it can be
// distinguished from a valid length.
func length_8_32(text []byte, sa []int32, numLMS int) { … }

// assignID_8_32 assigns a dense ID numbering to the
// set of LMS-substrings respecting string ordering and equality,
// returning the maximum assigned ID.
// For example given the input "ababab", the LMS-substrings
// are "aba", "aba", and "ab", renumbered as 2 2 1.
// sa[len(sa)-numLMS:] holds the LMS-substring indexes
// sorted in string order, so to assign numbers we can
// consider each in turn, removing adjacent duplicates.
// The new ID for the LMS-substring at index j is written to sa[j/2],
// overwriting the length previously stored there (by length_8_32 above).
func assignID_8_32(text []byte, sa []int32, numLMS int) int { … }

// map_32 maps the LMS-substrings in text to their new IDs,
// producing the subproblem for the recursion.
// The mapping itself was mostly applied by assignID_8_32:
// sa[i] is either 0, the ID for the LMS-substring at index 2*i,
// or the ID for the LMS-substring at index 2*i+1.
// To produce the subproblem we need only remove the zeros
// and change ID into ID-1 (our IDs start at 1, but text chars start at 0).
//
// map_32 packs the result, which is the input to the recursion,
// into the top of sa, so that the recursion result can be stored
// in the bottom of sa, which sets up for expand_8_32 well.
func map_32(sa []int32, numLMS int) { … }

// recurse_32 calls sais_32 recursively to solve the subproblem we've built.
// The subproblem is at the right end of sa, the suffix array result will be
// written at the left end of sa, and the middle of sa is available for use as
// temporary frequency and bucket storage.
func recurse_32(sa, oldTmp []int32, numLMS, maxID int) { … }

// unmap_8_32 unmaps the subproblem back to the original.
// sa[:numLMS] is the LMS-substring numbers, which don't matter much anymore.
// sa[len(sa)-numLMS:] is the sorted list of those LMS-substring numbers.
// The key part is that if the list says K that means the K'th substring.
// We can replace sa[:numLMS] with the indexes of the LMS-substrings.
// Then if the list says K it really means sa[K].
// Having mapped the list back to LMS-substring indexes,
// we can place those into the right buckets.
func unmap_8_32(text []byte, sa []int32, numLMS int) { … }

// expand_8_32 distributes the compacted, sorted LMS-suffix indexes
// from sa[:numLMS] into the tops of the appropriate buckets in sa,
// preserving the sorted order and making room for the L-type indexes
// to be slotted into the sorted sequence by induceL_8_32.
func expand_8_32(text []byte, freq, bucket, sa []int32, numLMS int) { … }

// induceL_8_32 inserts L-type text indexes into sa,
// assuming that the leftmost S-type indexes are inserted
// into sa, in sorted order, in the right bucket halves.
// It leaves all the L-type indexes in sa, but the
// leftmost L-type indexes are negated, to mark them
// for processing by induceS_8_32.
func induceL_8_32(text []byte, sa, freq, bucket []int32) { … }

func induceS_8_32(text []byte, sa, freq, bucket []int32) { … }
go/src/index/suffixarray/sais.go