data-prep-kit [Feature] Consider newer open language identification models

Search before asking

[x] I searched the issues and found no similar issues.

Component

transforms/lang_id

Feature

The current transform uses https://huggingface.co/facebook/fasttext-language-identification, model which supports 157 languages and is about 1.2GB.

There must be newer open source models that we should consider. We should investigate these models, based on their size and accuracy, either as replacement or as an additional option for the current model.

Are you willing to submit a PR?

[ ] Yes I am willing to submit a PR!

Jul 09 '25 19:07 shahrokhDaijavad

Hi @shahrokhDaijavad and team,

I’ve done a quick survey comparing the current FastText-based LangID model against newer transformer and lightweight alternatives. The main downside of the existing facebook/fasttext-language-identification (1.18 GB, 157 langs) is its lower accuracy on very short, ambiguous inputs and inability to handle inline code-switching.

Below is an approximate comparison of LangID models:

Feature	facebook/fasttext (2017)	papluca/xlm-roberta-base-language-detection (2021)	cis-lmu/glotlid (2024)	juliensimon/xlm-v-base-language-id (2023)	google/cld3 (2019)	langdetect (2014)
Type	Linear classifier	Transformer	Transformer	Transformer	Hybrid classifier	Linear classifier
Homepage	https://huggingface.co/facebook/fasttext-language-identification	https://huggingface.co/papluca/xlm-roberta-base-language-detection	https://huggingface.co/cis-lmu/glotlid	https://huggingface.co/juliensimon/xlm-v-base-language-id	https://github.com/google/cld3	https://pypi.org/project/langdetect/
Weight File	1.18 GB	1.14 GB	1.69 GB	3.11 GB	N/A (C++ library; no weight)	N/A (pure-Python; no weights)
Languages	157	20	100+	102 (fine-tuned from FastText)	47	~55
Accuracy	Baseline	High	Very high	High	Good	Lower
Speed	Very fast	Moderate	Moderate	Moderate	Extremely fast	Very fast
Context Handling	Low	High	High	High	Medium	Low
Code-Switch Handling	Poor	Good	Improved	Good	Poor	Poor
Integration Ease	Very easy	Easy	Easy	Easy	Medium	Very easy
Best Use Case	Batch pipelines	Modern NLP workflows	Context-sensitive LID	Mid-sized multilingual LID	Embedded/browser use	Lightweight/non-critical tasks

Next steps:

Benchmark these on internal LangID test set (short, ambiguous, and mixed-language samples)
Select top 1–2 candidates for an optional transformer-based transform
Open a PR to integrate and document the new model(s)

Jul 11 '25 08:07 MaryamZahiri

Hi, @MaryamZahiri. Thank you. Great comparison table! Yes, the next steps are in the right direction.

Jul 11 '25 14:07 shahrokhDaijavad

hello team and @shahrokhDaijavad
I was testing the models which @MaryamZahiri mentioned on some internal and add on tests(I can share those if needed : ] ) . My test results:

Model                                       | Accuracy | Total Time (s) | Avg/sample (s)
--------------------------------------------+----------+----------------+---------------
papluca/xlm-roberta-base-language-detection | 55.2%    | 1.638          | 0.056         
facebook/fasttext-language-identification   | 69.0%    | 0.001          | 0.000         
juliensimon/xlm-v-base-language-id          | 65.5%    | 4.419          | 0.152         
langdetect                                  | 55.2%    | 0.214          | 0.007

I mainly tested for:

Accuracy – Whether the model returns the correct ISO language code.
Speed – Both total runtime and average time per sample.
Short/Ambiguous Inputs – How well it handles single words or unclear tokens.
Context Awareness – Whether it uses full sentence context or just token frequencies.
Code-switching – Ability to handle mixed-language sentences in the same input.
Non-language Input Handling – How it treats numbers, punctuation, symbols, etc.
Named Entities & Borrowed Words – Detection of names, emojis, and words from other languages.

TL;DR: facebook/fasttext although old is still Insane for the time it takes and the size it has. Sure juliensimon/xlm-v-base-language-id is almost similar on performance with facebook/fasttextwith performing on similar or slightly worse on a few tests but very similar. But again the speed of facebook/fasttext is on a different league and is way faster than juliensimon/xlm-v-base-language-id and then the size of the later is more than the other one.

Till now facebook/fasttext is pretty much dominating the tests along with mindblowing speeds .

I am yet to test cis-lmu/glotlid and google/cld3 due to some installation issue will check again . I think espeically google/cld3 will be interesting to see as it is a CPP library . Will update soon :)

Also let me know if any other model to be tested?

Jul 11 '25 18:07 ShiroYasha18