data-prep-kit icon indicating copy to clipboard operation
data-prep-kit copied to clipboard

[Feature] Consider newer open language identification models

Open shahrokhDaijavad opened this issue 5 months ago • 3 comments

Search before asking

  • [x] I searched the issues and found no similar issues.

Component

transforms/lang_id

Feature

The current transform uses https://huggingface.co/facebook/fasttext-language-identification, model which supports 157 languages and is about 1.2GB.

There must be newer open source models that we should consider. We should investigate these models, based on their size and accuracy, either as replacement or as an additional option for the current model.

Are you willing to submit a PR?

  • [ ] Yes I am willing to submit a PR!

shahrokhDaijavad avatar Jul 09 '25 19:07 shahrokhDaijavad

Hi @shahrokhDaijavad and team,

I’ve done a quick survey comparing the current FastText-based LangID model against newer transformer and lightweight alternatives. The main downside of the existing facebook/fasttext-language-identification (1.18 GB, 157 langs) is its lower accuracy on very short, ambiguous inputs and inability to handle inline code-switching.

Below is an approximate comparison of LangID models:

Feature facebook/fasttext (2017) papluca/xlm-roberta-base-language-detection (2021) cis-lmu/glotlid (2024) juliensimon/xlm-v-base-language-id (2023) google/cld3 (2019) langdetect (2014)
Type Linear classifier Transformer Transformer Transformer Hybrid classifier Linear classifier
Homepage https://huggingface.co/facebook/fasttext-language-identification https://huggingface.co/papluca/xlm-roberta-base-language-detection https://huggingface.co/cis-lmu/glotlid https://huggingface.co/juliensimon/xlm-v-base-language-id https://github.com/google/cld3 https://pypi.org/project/langdetect/
Weight File 1.18 GB 1.14 GB 1.69 GB 3.11 GB N/A (C++ library; no weight) N/A (pure-Python; no weights)
Languages 157 20 100+ 102 (fine-tuned from FastText) 47 ~55
Accuracy Baseline High Very high High Good Lower
Speed Very fast Moderate Moderate Moderate Extremely fast Very fast
Context Handling Low High High High Medium Low
Code-Switch Handling Poor Good Improved Good Poor Poor
Integration Ease Very easy Easy Easy Easy Medium Very easy
Best Use Case Batch pipelines Modern NLP workflows Context-sensitive LID Mid-sized multilingual LID Embedded/browser use Lightweight/non-critical tasks

Next steps:

  1. Benchmark these on internal LangID test set (short, ambiguous, and mixed-language samples)
  2. Select top 1–2 candidates for an optional transformer-based transform
  3. Open a PR to integrate and document the new model(s)

MaryamZahiri avatar Jul 11 '25 08:07 MaryamZahiri

Hi, @MaryamZahiri. Thank you. Great comparison table! Yes, the next steps are in the right direction.

shahrokhDaijavad avatar Jul 11 '25 14:07 shahrokhDaijavad

hello team and @shahrokhDaijavad
I was testing the models which @MaryamZahiri mentioned on some internal and add on tests(I can share those if needed : ] ) . My test results:

Model                                       | Accuracy | Total Time (s) | Avg/sample (s)
--------------------------------------------+----------+----------------+---------------
papluca/xlm-roberta-base-language-detection | 55.2%    | 1.638          | 0.056         
facebook/fasttext-language-identification   | 69.0%    | 0.001          | 0.000         
juliensimon/xlm-v-base-language-id          | 65.5%    | 4.419          | 0.152         
langdetect                                  | 55.2%    | 0.214          | 0.007         

I mainly tested for:

  1. Accuracy – Whether the model returns the correct ISO language code.

  2. Speed – Both total runtime and average time per sample.

  3. Short/Ambiguous Inputs – How well it handles single words or unclear tokens.

  4. Context Awareness – Whether it uses full sentence context or just token frequencies.

  5. Code-switching – Ability to handle mixed-language sentences in the same input.

  6. Non-language Input Handling – How it treats numbers, punctuation, symbols, etc.

  7. Named Entities & Borrowed Words – Detection of names, emojis, and words from other languages.

TL;DR: facebook/fasttext although old is still Insane for the time it takes and the size it has. Sure juliensimon/xlm-v-base-language-id is almost similar on performance with facebook/fasttextwith performing on similar or slightly worse on a few tests but very similar. But again the speed of facebook/fasttext is on a different league and is way faster than juliensimon/xlm-v-base-language-id and then the size of the later is more than the other one.

Till now facebook/fasttext is pretty much dominating the tests along with mindblowing speeds .

I am yet to test cis-lmu/glotlid and google/cld3 due to some installation issue will check again . I think espeically google/cld3 will be interesting to see as it is a CPP library . Will update soon :)

Also let me know if any other model to be tested?

ShiroYasha18 avatar Jul 11 '25 18:07 ShiroYasha18