Annif Need for maintained fastText project

I noticed the official fastText repository was archived Mar 19, 2024 and is thus no more maintained at all.

Annif already installs fastText from the fasttext-wheel PyPI project, which builds code from https://github.com/messense/fasttext-wheel (Annif initially switched to that because of this installation issue).

At some point installation from the fasttext-wheel project probably won't work. In the fastText mailing list someone already reported a compatibility issue with NumPy 2.0. Actually when I now try to install the fasttext backend dependencies of Annif on Python 3.12 on my laptop, Poetry complains Unable to find installation candidates for fasttext-wheel (0.9.2) (two weeks ago, but after NumPy 2.0 release, CICD successfully installed fastText).

If we want retain the fasttext backend in Annif for a long term, some alternative provider for fastText should be found. There are a number of more or less active fastText forks.

Jul 05 '24 10:07 juhoinkinen

An interesting candidate is Floret, available in PyPI.

It could be more memory-efficient than fastText:

In order to store word and subword vectors in a more compact format, we turn to an algorithm that's been used by spaCy all along: Bloom embeddings. Bloom embeddings (also called the "hashing trick", or known as HashEmbed within spaCy's ML library thinc) can be used to store distinct representations in a compact table by hashing each entry into multiple rows in the table. By representing each entry as the sum of multiple rows, where it's unlikely that two entries will collide on multiple hashes, most entries will end up with a distinct representation.

It should provide the same functionality as fastText:

floret supports all existing fasttext commands and does not modify any fasttext defaults.

Edit: Just switching to use the floret provided fastText algorithm needs very minimal changes: https://github.com/NatLibFi/Annif/commit/c6d23d8c4e6909b1b3d725498f133396a01e5f38

Jul 05 '24 10:07 juhoinkinen

InseeFrLab/torch-fastText is a fastText implementation using PyTorch by the National Institute of Statistics and Economic of France.

Jul 10 '25 19:07 juhoinkinen

This is getting more urgent. We had to switch to another fasttext fork (#890) to make it work with NumPy 2. But there are mysterious "Encountered NaN" issues, somehow related to spaCy (?). See https://github.com/NatLibFi/Annif/pull/890#issuecomment-3244794054

I think we will have to get rid of original fastText soon after the 1.4 release and switch to something else, perhaps the PyTorch reimplementation mentioned above.

Sep 02 '25 12:09 osma

InseeFrLab/torch-fastText is a fastText implementation using PyTorch by the National Institute of Statistics and Economic of France.

torch-fastText has lately been renamed/rebranded as torchTextClassifiers, discussion in this issue: https://github.com/InseeFrLab/torch-fastText/issues/50

Sep 24 '25 12:09 juhoinkinen

Interesting! If we want to go this way, I think it would make sense to create a new backend (perhaps called ttc) and drop the current fastText backend. Of couse this needs some testing to confirm that torchTextClassifiers provides similar performance as fastText for our usage scenarios.

Also, we probably should drop the TensorFlow dependency first, see #895 .

Sep 24 '25 12:09 osma