[Feature] Consider newer open language identification models
Search before asking
- [x] I searched the issues and found no similar issues.
Component
transforms/lang_id
Feature
The current transform uses https://huggingface.co/facebook/fasttext-language-identification, model which supports 157 languages and is about 1.2GB.
There must be newer open source models that we should consider. We should investigate these models, based on their size and accuracy, either as replacement or as an additional option for the current model.
Are you willing to submit a PR?
- [ ] Yes I am willing to submit a PR!
Hi @shahrokhDaijavad and team,
I’ve done a quick survey comparing the current FastText-based LangID model against newer transformer and lightweight alternatives. The main downside of the existing facebook/fasttext-language-identification (1.18 GB, 157 langs) is its lower accuracy on very short, ambiguous inputs and inability to handle inline code-switching.
Below is an approximate comparison of LangID models:
| Feature | facebook/fasttext (2017) | papluca/xlm-roberta-base-language-detection (2021) | cis-lmu/glotlid (2024) | juliensimon/xlm-v-base-language-id (2023) | google/cld3 (2019) | langdetect (2014) |
|---|---|---|---|---|---|---|
| Type | Linear classifier | Transformer | Transformer | Transformer | Hybrid classifier | Linear classifier |
| Homepage | https://huggingface.co/facebook/fasttext-language-identification | https://huggingface.co/papluca/xlm-roberta-base-language-detection | https://huggingface.co/cis-lmu/glotlid | https://huggingface.co/juliensimon/xlm-v-base-language-id | https://github.com/google/cld3 | https://pypi.org/project/langdetect/ |
| Weight File | 1.18 GB | 1.14 GB | 1.69 GB | 3.11 GB | N/A (C++ library; no weight) | N/A (pure-Python; no weights) |
| Languages | 157 | 20 | 100+ | 102 (fine-tuned from FastText) | 47 | ~55 |
| Accuracy | Baseline | High | Very high | High | Good | Lower |
| Speed | Very fast | Moderate | Moderate | Moderate | Extremely fast | Very fast |
| Context Handling | Low | High | High | High | Medium | Low |
| Code-Switch Handling | Poor | Good | Improved | Good | Poor | Poor |
| Integration Ease | Very easy | Easy | Easy | Easy | Medium | Very easy |
| Best Use Case | Batch pipelines | Modern NLP workflows | Context-sensitive LID | Mid-sized multilingual LID | Embedded/browser use | Lightweight/non-critical tasks |
Next steps:
- Benchmark these on internal LangID test set (short, ambiguous, and mixed-language samples)
- Select top 1–2 candidates for an optional transformer-based transform
- Open a PR to integrate and document the new model(s)
Hi, @MaryamZahiri. Thank you. Great comparison table! Yes, the next steps are in the right direction.
hello team and @shahrokhDaijavad
I was testing the models which @MaryamZahiri mentioned on some internal and add on tests(I can share those if needed : ] ) .
My test results:
Model | Accuracy | Total Time (s) | Avg/sample (s)
--------------------------------------------+----------+----------------+---------------
papluca/xlm-roberta-base-language-detection | 55.2% | 1.638 | 0.056
facebook/fasttext-language-identification | 69.0% | 0.001 | 0.000
juliensimon/xlm-v-base-language-id | 65.5% | 4.419 | 0.152
langdetect | 55.2% | 0.214 | 0.007
I mainly tested for:
-
Accuracy – Whether the model returns the correct ISO language code.
-
Speed – Both total runtime and average time per sample.
-
Short/Ambiguous Inputs – How well it handles single words or unclear tokens.
-
Context Awareness – Whether it uses full sentence context or just token frequencies.
-
Code-switching – Ability to handle mixed-language sentences in the same input.
-
Non-language Input Handling – How it treats numbers, punctuation, symbols, etc.
-
Named Entities & Borrowed Words – Detection of names, emojis, and words from other languages.
TL;DR: facebook/fasttext although old is still Insane for the time it takes and the size it has. Sure juliensimon/xlm-v-base-language-id is almost similar on performance with facebook/fasttextwith performing on similar or slightly worse on a few tests but very similar. But again the speed of facebook/fasttext is on a different league and is way faster than juliensimon/xlm-v-base-language-id and then the size of the later is more than the other one.
Till now facebook/fasttext is pretty much dominating the tests along with mindblowing speeds .
I am yet to test cis-lmu/glotlid and google/cld3 due to some installation issue will check again . I think espeically google/cld3 will be interesting to see as it is a CPP library . Will update soon :)
Also let me know if any other model to be tested?