pydatastructs
pydatastructs copied to clipboard
Add Suffix Tree
Description of the problem
Example of the problem
References/Other comments
[1] https://en.wikipedia.org/wiki/Suffix_tree
can you please assign this issue to me? @prshnt19
Please read https://github.com/codezonediitj/pydatastructs/wiki/Issue-Policy
Thanks @prshnt19 for sharing the issue policy. @subhangi2731 I'd request you to go through https://github.com/codezonediitj/pydatastructs/wiki/Plan-of-Action-for-Adding-New-Data-Structures since requires adding a new adding new data structure from scratch.
Let me take up this one @czgdp1807
Sure.
suffix
I believe the API design would be like the below part
__new__(cls, array_flag = 0, key=None, root_data=None)
This would initialise a new suffix tree.
Rest of the parameters will be same as other tree implementation.
insert(self, key, data)
delete(self, key, **kwargs)
__str__(self)
height(self)
Will return the height of the tree.
Additional features
longest_common_substring
This method will return the longest common substring from the string comparison with tree
longest_repeated_substring
This method will return the longest repeated substring from the string comparison with tree
Can you pick example where suffix tree is used and show how above APIs will be used in that example? There might be an example on Wikipedia or some other website.
https://www.cise.ufl.edu/~sahni/dsaaj/enrich/c16/suffix.htm hope in this site we get the example with analyzation and implementation example is as https://www.geeksforgeeks.org/ukkonens-suffix-tree-construction-part-6/?ref=rp
As suffix tree isn't like any other binary tree i hope it has to be implemented in separate file and shall I go on with that @czgdp1807
Following are the questions and concerns,
- Is Suffix tree static or dynamic? In other words, is it possible to update it after constructing it from a string/set of strings? Or once created, it can be changed? Can you provide university lectures/papers which provide algorithms updating a suffix tree. I am unable to find anything relevant on Wikipedia or in this document.
- As of now we should go for implementing the applications in https://en.wikipedia.org/wiki/Suffix_tree#Applications as methods of the
SuffixTree
class. Later on once we suffix tree is added, more features can be added given in, https://en.wikipedia.org/wiki/Suffix_tree#Implementation
Comments on API
new(cls, array_flag = 0, key=None, root_data=None)
I feel that this is a bit complicated. Most of the literature writes about creating suffix trees from a set of strings. So, the API should be as simple as,
__new__(cls, strings)
strings
should be an iterable(list, set, tuple, arrays, etc).
insert(self, key, data)
delete(self, key, **kwargs)
This depends on whether we want to keep static or dynamic. If there are algorithms for updating a dynamic trees then we can go for it. However, what is meant by key
? Shouldn't it be simply a string
.
height(self)
Will return the height of the tree.
Why do we need it? Mostly suffix trees are used to perform queries on strings, so this method may not provide very useful information. Any examples where height
can be required by the end user?
longest_common_substring
This method will return the longest common substring from the string comparison with tree
longest_repeated_substring
This method will return the longest repeated substring from the string comparison with tree
They both should accept strings (set of strings or a single string should be decided upon). More features given in https://en.wikipedia.org/wiki/Suffix_tree#Applications should also be added.
As far my research on the above tree after the conversation !. It is static and cannot changed after implementation as given in reference with https://users.dcc.uchile.cl/~gnavarro/ps/cpm08.2.pdf 2. Cool then just the implementation of suffix tree
I will take those changes and reframe the API
suffix
__new__(cls, strings)
The implementation of several features are quite complicated I hope it should be done step by step.
Sounds good. Though, we should have generalised suffix tree which should be capable of storing information about a set of strings. The tricky part is to end each string in the set with a different terminal symbol.
The implementation of several features are quite complicated I hope it should be done step by step.
You can start with implementing the logic for building the suffix tree and then add features in each new commit. Please do it locally and push only when the tests you add for suffix tree are passing.
@czgdp1807 Cool, I'm working on that
If this issue is open can u please assign me