Text length warning does not consider language, flags valid English sentences
Currently, in infer.py#class LangDetector#def _preprocess_text(), a "text too long" warning is raised whenever the text exceeds 100 characters.
In English or Korean, however, 100 characters often corresponds to only a short or medium-length sentence. This makes the warning misleading, since many valid sentences trigger it.
Suggestion:
Perform language detection first, then apply length thresholds appropriate to each language. This would ensure that the "text too long" warning is triggered only when the text is genuinely too long for the detected language.
This is indeed a good question. However, this length can be difficult to determine. We know that overly long texts can affect prediction accuracy, but we don't know the exact length.
Logically, I shouldn't do this truncation.
Throwing a warning is actually a good way to draw users' attention. My initial idea was to use an if clause right after obtaining the result but you might have a better suggestion to integrate it more elegantly. That's why I decided to leave it as an issue rather than a PR. Instead of removing the warning, I'd actually prefer introducing a minimum threshold that trigger it. The model tends to confuse CJK when dealing with short texts under my examination.
I have a couple of thoughts on how to approach this. Neither we nor the users need a precise or absolute number. The effect is more of a continuous curve, and identifying a clear turning point is inherently difficult. Even a tiny change in the input can shift the curve largely. Similar to poverty threshold, we might use the median of sentence length (base on linguistic and literature statistic) and multiply it by a certain coefficient to define what qualfies as "Long". Alternatively, we could use a percentile-based approach—say, the 75th or 85th percentile—as the threshold.
Implementation
- Sino-Tibetan language and Japanese-Ryukyuan languages, in other hand, can be determined by the number of characters present in sentence.
- Most modern Germanic languages, or even Indo-Europian language, can be determined by the number of space present in the sentence. Languages like Māori and many African languages often use romanized writing systems, so space-counting should also be effective. Unless the lib is designed to deal with ancient language like Persian and sanskrit, this apporoach will be more than sufficient.
So the direction is fairly clear:
- If the language belongs to Group 1, use
len(text) - If it belongs to Group 2, count the number of spaces
I would be happy to create a multi-language word counting library to deal with it. It should be useful not only in this project.
I think a small test will help us understand the real input length-score factor.
Generally speaking, I don't want libraries to be overly complex; truncation is designed for edge cases.
However, we can still provide this tool independently, provided we have benchmarks to prove its effectiveness.
https://github.com/JackyHe398/len-sentence It's actually pretty lightweight. Just a simple regular expression logic. The only complicated part is enumerating all the writing system codes.
I might add this use case to the README to guide users on how to sample more effectively. I don’t plan to build it in; instead, I'll give users more flexibility in handling it, allowing for a more transparent process.