Fast_Sentence_Embeddings
Fast_Sentence_Embeddings copied to clipboard
maintenance
Hi Oliver, You have merged few PR 's but still have not issued new version. Is there chance for that in a near future? Or maybe if you are busy with other projects could you add maintainers to your repo. Me or the guy that did recent PR would be more than happy to contribute and help this package to survive @oborchers
Hi @mathias3! Help would be very welcome. Although I don't have so much time for this repository, it's still a good thing to have and fills a niche that cannot be filled by more recent advancement in the NLP world.
I've created version 0.1.17 and I have fixed the most glaring issues with the repository, mainly related to the gensim and python incompatibilities.
There is also still the develop
branch, which contains many fixes and new features I originally planned to implement or are implemented partially. For example, the code for the following models is fully or partially there:
- Added Hierarchical (Convolutional) Embeddings for all Models
- Added MaxPooling
- Added Features to Sentencevectors
- Added further unittests
- Workaround for Numpy memmap issue (https://github.com/numpy/numpy/issues/13172)
- SVD ram subsampling for SIF / uSIF (customizable, standard is 1 GB of RAM)
- Minor fixes for nan-handling
- Minor fixes for sentencevectors class
There are a few things which might make sense to add to the roadmap:
- Newer models (I don't know, not up to date in this regard)
- Working the hierarchical op into the main averaging cython routine
- Support for a user definable embedding class (i.e. fse version of
BaseKeyedVectors
to get away from the Gensim dependency) - Different CI (Travis free mode not longer available)
- Add
pre_inference
andpost_inference
(I think I forgot this one) - Refactoring the horribly complicated
Input
class - Reworking the threading (at least from my last experience the input thread is the bottleneck, not the actual computation)
- Untangling the bad design decision to actually store the
BaseKeyedVector
from Gensim internally. If users want mmap, they can just load that and pass it. - Edit: Approximate nearest neighbor search (i.e. by annoy support)?
- Return vectors only above a certain threshold #34
- Fix zero division error #47
Happy to work on some of the issues as well, should have more time next year
Who might be interested to help? @mathias3 @grantmwilliams @AlexMRuch
@mathias3: There is also a new version on pypi: 0.1.17
Fixed / added in 0.2.0
:
- Offering pretrained models and making them accessible
- Fix zero division error
- Bugfixes for python 3.8 builds
- Code refactoring to black style