fast-langdetect icon indicating copy to clipboard operation
fast-langdetect copied to clipboard

Text length warning does not consider language, flags valid English sentences

Open JackyHe398 opened this issue 7 months ago • 6 comments

Currently, in infer.py#class LangDetector#def _preprocess_text(), a "text too long" warning is raised whenever the text exceeds 100 characters.

In English or Korean, however, 100 characters often corresponds to only a short or medium-length sentence. This makes the warning misleading, since many valid sentences trigger it.

Suggestion:
Perform language detection first, then apply length thresholds appropriate to each language. This would ensure that the "text too long" warning is triggered only when the text is genuinely too long for the detected language.

JackyHe398 avatar Sep 17 '25 09:09 JackyHe398

This is indeed a good question. However, this length can be difficult to determine. We know that overly long texts can affect prediction accuracy, but we don't know the exact length.

Logically, I shouldn't do this truncation.

sudoskys avatar Sep 17 '25 12:09 sudoskys

Throwing a warning is actually a good way to draw users' attention. My initial idea was to use an if clause right after obtaining the result but you might have a better suggestion to integrate it more elegantly. That's why I decided to leave it as an issue rather than a PR. Instead of removing the warning, I'd actually prefer introducing a minimum threshold that trigger it. The model tends to confuse CJK when dealing with short texts under my examination.

I have a couple of thoughts on how to approach this. Neither we nor the users need a precise or absolute number. The effect is more of a continuous curve, and identifying a clear turning point is inherently difficult. Even a tiny change in the input can shift the curve largely. Similar to poverty threshold, we might use the median of sentence length (base on linguistic and literature statistic) and multiply it by a certain coefficient to define what qualfies as "Long". Alternatively, we could use a percentile-based approach—say, the 75th or 85th percentile—as the threshold.

Implementation

  1. Sino-Tibetan language and Japanese-Ryukyuan languages, in other hand, can be determined by the number of characters present in sentence.
  2. Most modern Germanic languages, or even Indo-Europian language, can be determined by the number of space present in the sentence. Languages like Māori and many African languages often use romanized writing systems, so space-counting should also be effective. Unless the lib is designed to deal with ancient language like Persian and sanskrit, this apporoach will be more than sufficient.

So the direction is fairly clear:

  • If the language belongs to Group 1, use len(text)
  • If it belongs to Group 2, count the number of spaces

JackyHe398 avatar Sep 17 '25 16:09 JackyHe398

I would be happy to create a multi-language word counting library to deal with it. It should be useful not only in this project.

JackyHe398 avatar Sep 17 '25 16:09 JackyHe398

I think a small test will help us understand the real input length-score factor.

Generally speaking, I don't want libraries to be overly complex; truncation is designed for edge cases.

However, we can still provide this tool independently, provided we have benchmarks to prove its effectiveness.

sudoskys avatar Sep 18 '25 01:09 sudoskys

https://github.com/JackyHe398/len-sentence It's actually pretty lightweight. Just a simple regular expression logic. The only complicated part is enumerating all the writing system codes.

JackyHe398 avatar Sep 18 '25 06:09 JackyHe398

I might add this use case to the README to guide users on how to sample more effectively. I don’t plan to build it in; instead, I'll give users more flexibility in handling it, allowing for a more transparent process.

sudoskys avatar Sep 18 '25 07:09 sudoskys