Fast_Sentence_Embeddings icon indicating copy to clipboard operation
Fast_Sentence_Embeddings copied to clipboard

maintenance

Open mathias3 opened this issue 3 years ago • 3 comments

Hi Oliver, You have merged few PR 's but still have not issued new version. Is there chance for that in a near future? Or maybe if you are busy with other projects could you add maintainers to your repo. Me or the guy that did recent PR would be more than happy to contribute and help this package to survive @oborchers

mathias3 avatar Nov 18 '21 07:11 mathias3

Hi @mathias3! Help would be very welcome. Although I don't have so much time for this repository, it's still a good thing to have and fills a niche that cannot be filled by more recent advancement in the NLP world.

I've created version 0.1.17 and I have fixed the most glaring issues with the repository, mainly related to the gensim and python incompatibilities.

There is also still the develop branch, which contains many fixes and new features I originally planned to implement or are implemented partially. For example, the code for the following models is fully or partially there:

  • Added Hierarchical (Convolutional) Embeddings for all Models
  • Added MaxPooling
  • Added Features to Sentencevectors
  • Added further unittests
  • Workaround for Numpy memmap issue (https://github.com/numpy/numpy/issues/13172)
  • SVD ram subsampling for SIF / uSIF (customizable, standard is 1 GB of RAM)
  • Minor fixes for nan-handling
  • Minor fixes for sentencevectors class

There are a few things which might make sense to add to the roadmap:

  • Newer models (I don't know, not up to date in this regard)
  • Working the hierarchical op into the main averaging cython routine
  • Support for a user definable embedding class (i.e. fse version of BaseKeyedVectors to get away from the Gensim dependency)
  • Different CI (Travis free mode not longer available)
  • Add pre_inference and post_inference (I think I forgot this one)
  • Refactoring the horribly complicated Input class
  • Reworking the threading (at least from my last experience the input thread is the bottleneck, not the actual computation)
  • Untangling the bad design decision to actually store the BaseKeyedVector from Gensim internally. If users want mmap, they can just load that and pass it.
  • Edit: Approximate nearest neighbor search (i.e. by annoy support)?
  • Return vectors only above a certain threshold #34
  • Fix zero division error #47

Happy to work on some of the issues as well, should have more time next year

Who might be interested to help? @mathias3 @grantmwilliams @AlexMRuch

oborchers avatar Nov 27 '21 13:11 oborchers

@mathias3: There is also a new version on pypi: 0.1.17

oborchers avatar Nov 27 '21 14:11 oborchers

Fixed / added in 0.2.0:

  • Offering pretrained models and making them accessible
  • Fix zero division error
  • Bugfixes for python 3.8 builds
  • Code refactoring to black style

oborchers avatar Dec 03 '21 10:12 oborchers