pydatastructs icon indicating copy to clipboard operation
pydatastructs copied to clipboard

Add Suffix Tree

Open prshnt19 opened this issue 4 years ago • 15 comments

Description of the problem

Example of the problem

References/Other comments

[1] https://en.wikipedia.org/wiki/Suffix_tree

prshnt19 avatar Jun 23 '20 09:06 prshnt19

can you please assign this issue to me? @prshnt19

subhangi2731 avatar Dec 07 '20 03:12 subhangi2731

Please read https://github.com/codezonediitj/pydatastructs/wiki/Issue-Policy

prshnt19 avatar Dec 07 '20 04:12 prshnt19

Thanks @prshnt19 for sharing the issue policy. @subhangi2731 I'd request you to go through https://github.com/codezonediitj/pydatastructs/wiki/Plan-of-Action-for-Adding-New-Data-Structures since requires adding a new adding new data structure from scratch.

czgdp1807 avatar Dec 07 '20 07:12 czgdp1807

Let me take up this one @czgdp1807

Arvind-raj06 avatar Jan 19 '21 14:01 Arvind-raj06

Sure.

czgdp1807 avatar Jan 19 '21 15:01 czgdp1807

suffix I believe the API design would be like the below part

__new__(cls, array_flag = 0, key=None, root_data=None)

This would initialise a new suffix tree.

Rest of the parameters will be same as other tree implementation.

insert(self, key, data) delete(self, key, **kwargs) __str__(self)

height(self) Will return the height of the tree.

Additional features

longest_common_substring This method will return the longest common substring from the string comparison with tree

longest_repeated_substring This method will return the longest repeated substring from the string comparison with tree

Arvind-raj06 avatar Jan 20 '21 07:01 Arvind-raj06

Can you pick example where suffix tree is used and show how above APIs will be used in that example? There might be an example on Wikipedia or some other website.

czgdp1807 avatar Jan 20 '21 07:01 czgdp1807

https://www.cise.ufl.edu/~sahni/dsaaj/enrich/c16/suffix.htm hope in this site we get the example with analyzation and implementation example is as https://www.geeksforgeeks.org/ukkonens-suffix-tree-construction-part-6/?ref=rp

Arvind-raj06 avatar Jan 20 '21 08:01 Arvind-raj06

As suffix tree isn't like any other binary tree i hope it has to be implemented in separate file and shall I go on with that @czgdp1807

Arvind-raj06 avatar Jan 20 '21 13:01 Arvind-raj06

Following are the questions and concerns,

  1. Is Suffix tree static or dynamic? In other words, is it possible to update it after constructing it from a string/set of strings? Or once created, it can be changed? Can you provide university lectures/papers which provide algorithms updating a suffix tree. I am unable to find anything relevant on Wikipedia or in this document.
  2. As of now we should go for implementing the applications in https://en.wikipedia.org/wiki/Suffix_tree#Applications as methods of the SuffixTree class. Later on once we suffix tree is added, more features can be added given in, https://en.wikipedia.org/wiki/Suffix_tree#Implementation

Comments on API

new(cls, array_flag = 0, key=None, root_data=None)

I feel that this is a bit complicated. Most of the literature writes about creating suffix trees from a set of strings. So, the API should be as simple as,

__new__(cls, strings)

strings should be an iterable(list, set, tuple, arrays, etc).

insert(self, key, data) delete(self, key, **kwargs)

This depends on whether we want to keep static or dynamic. If there are algorithms for updating a dynamic trees then we can go for it. However, what is meant by key? Shouldn't it be simply a string.

height(self) Will return the height of the tree.

Why do we need it? Mostly suffix trees are used to perform queries on strings, so this method may not provide very useful information. Any examples where height can be required by the end user?

longest_common_substring This method will return the longest common substring from the string comparison with tree

longest_repeated_substring This method will return the longest repeated substring from the string comparison with tree

They both should accept strings (set of strings or a single string should be decided upon). More features given in https://en.wikipedia.org/wiki/Suffix_tree#Applications should also be added.

czgdp1807 avatar Jan 21 '21 05:01 czgdp1807

As far my research on the above tree after the conversation !. It is static and cannot changed after implementation as given in reference with https://users.dcc.uchile.cl/~gnavarro/ps/cpm08.2.pdf 2. Cool then just the implementation of suffix tree

I will take those changes and reframe the API

Arvind-raj06 avatar Jan 21 '21 06:01 Arvind-raj06

suffix

__new__(cls, strings)

The implementation of several features are quite complicated I hope it should be done step by step.

Arvind-raj06 avatar Jan 22 '21 16:01 Arvind-raj06

Sounds good. Though, we should have generalised suffix tree which should be capable of storing information about a set of strings. The tricky part is to end each string in the set with a different terminal symbol.

The implementation of several features are quite complicated I hope it should be done step by step.

You can start with implementing the logic for building the suffix tree and then add features in each new commit. Please do it locally and push only when the tests you add for suffix tree are passing.

czgdp1807 avatar Jan 23 '21 05:01 czgdp1807

@czgdp1807 Cool, I'm working on that

Arvind-raj06 avatar Jan 23 '21 16:01 Arvind-raj06

If this issue is open can u please assign me

subhangi2731 avatar May 10 '21 07:05 subhangi2731