Feature Request: Implement Suffix Array and Suffix Tree
A Suffix Array for a string S of length n is an array of integers in the range [0, n-1] that specifies the lexicographical order of all suffixes of S. It is often used in conjunction with the Longest Common Prefix (LCP) array, which stores the length of the longest common prefix between consecutive suffixes in the sorted suffix array.
A Suffix Tree for a string S is a rooted tree where each edge represents a non-empty substring of S, and each path from the root to a leaf node represents a unique suffix of S. Internal nodes (except possibly the root) have at least two children.
Use Cases:
Both Suffix Arrays and Suffix Trees have numerous applications in various fields, including:
Bioinformatics: Sequence alignment, finding repeated patterns in DNA or protein sequences. Text Processing: Full-text search, indexing, text compression. Data Compression: Algorithms like LZ77 use concepts related to finding repeated substrings. Information Retrieval: Efficiently searching large text corpora. Computational Linguistics: Analyzing text structure and patterns. Finding the Longest Common Substring of multiple strings. Finding the Longest Palindromic Substring. String Alignment.
I propose implementing the following:
Suffix Array:
A class SuffixArray that takes a string as input during initialization. A method to build the suffix array. A method to optionally compute the LCP array. Methods for searching for a pattern within the text using the suffix array.
Suffix Tree:
A class SuffixTree that takes a string as input during initialization. A method to build the suffix tree.
I am interested in contributing to the implementation of these data structures. I would appreciate feedback from the project maintainers on this proposal.