pytextrank
pytextrank copied to clipboard
Documentation or Inclusion of other algorithms
The models and algorithms in https://github.com/boudinfl/pke#implemented-models are similar to Textrank but not sped up by SpaCy, so it might be a good idea to include them in PyTextRank
PS: There are also other non TextRank-esque algorithms to consider when making this assessment:
- RAKE https://github.com/aneesha/RAKE and https://github.com/csurfer/rake-nltk and https://github.com/vgrabovets/multi_rake and https://github.com/chinwuDebug/RAKE_improve
- YAKE https://github.com/LIAAD/yake
- Aho–Corasick algorithm https://github.com/dav009/flash
- RaKUn https://github.com/Parsely/serpextract
thanks for bringing our attention to pke
!
this issue is similar to #78 for which we have made already great progress with 2 contributions:
- adding
PositionRank
andBiasedRank
- adding
BaseTextRank
andBaseTextRankFactory
to enable integration of more flavours
Regarding the graph based models of pke
, I can see this:
- their
TextRank
can be achieved with ourBaseTextRank(edge_weight=0)
- their
SingleRank
can be achieved with ourBaseTextRank()
orBaseTextRank(edge_weight=1.0)
- their
PositionRank
can be achieved with ourPositionRank
the following ones are missing:
- TopicRank paper by (Bougouin et al., 2013)
- TopicalPageRank article by (Sterckx et al., 2015)
- MultipartiteRank article by (Boudin, 2018)
I was not aware of these 3 papers and approaches so thank you. Do you have experience with them in practice and are they good? Would you be open to contribute them?
I am mainly reporting them for notes in Documentation, but if I can I would contribute
Also some extra note: https://github.com/miso-belica/sumy/blob/master/docs/alternatives.md
- Bipartite HITS https://github.com/himanshujindal/Automatic-Text-Summarizer
- LexRank https://github.com/giorgosera/pythia/blob/dev/analysis/summarization/summarization.py https://github.com/kylehg/summarizer
- topic models https://github.com/bobflagg/Topic-Networks
- MEAD http://www.summarization.com/mead/
- Luhn Summ https://github.com/talha1503/Extractive_Text_Summarization/blob/master/luhn_sum.py
- SumBasic https://github.com/talha1503/Extractive_Text_Summarization/blob/master/SumBasic.ipynb
To reiterate the current algorithms that are not included:
- [ ] SumBasic by Nankova et. al. and its Repository in Python
- [ ] LexRank by Erkan et. al. and its Repository in Python
- [ ] SalianceRank teneva et. al. by and its Reposiroty in Python
- [ ] KEA by Witten et. al. and its Repository in Java
- [ ] UniKeyPhrase by Wu et. al. and its Repository in Python
- [ ] https://github.com/boudinfl/pke#implemented-models
- [ ] TopicRank paper by (Bougouin et al., 2013)
- [ ] TopicalPageRank article by (Sterckx et al., 2015)
- [ ] MultipartiteRank article by (Boudin, 2018)
Looking at
- [ ] JAKE https://github.com/xcjackpan/jake
- [ ] Crackr https://github.com/anjishnu/Crackr
Also check the algorithms listed in pke
https://github.com/boudinfl/pke which has an excellent range of implementations. FWIW, that library is GPL and not implemented as a spaCy pipeline, so there's some room for algorithm implementations both there (for research) and here (for production deployments).