Improve language detection
Language detection works bad for chriomium-based browsers
https://github.com/user-attachments/assets/93782cf4-9454-4cad-8963-3cd1c861b024
Let's looking for alternative solution to embedded in browser
Bug mostly reproduces when user select one or few words. So the solution must handle primarily this use case.
We could make contest between browser API and alternative solution and choose those language that return the best score.
@vitonsky
Yes, I have been dealing with this issue for a long time and wanted to raise an issue about it myself. And then I stumbled on this issue with a good demonstration of this behavior.
I did a little research on alternative auto-detection tools. And I didn't find any suitable open source tools :(
If there are any, they are either:
- a) Don't work with short text.
- b) Produce inaccurate results, like the one currently being reproduced.
- c) Are interesting, but limited in their demonstration capabilities. They can only be evaluated through auto-tests in source code, which are very limited and only consider the best cases. Therefore, it is difficult to draw definitive conclusions about their effectiveness through a quick review.
Here is what I research:
- https://laszlopandy.github.io/ts-language-detection/
- https://developer.mozilla.org/en-US/docs/Web/API/Translator_and_Language_Detector_APIs/Using#result
- https://komodojp.github.io/tinyld
- https://github.com/nitotm/efficient-language-detector-js
- https://github.com/fabiospampinato/lande
- And some others from the list: https://www.npmjs.com/search?page=0&q=language%20detection&sortBy=score
While conducting this small research, I thought about something else: maybe we could try to make the own language detection solution? One based on algorithms used in open source translators, such as LibreTranslate (https://github.com/LibreTranslate/LibreTranslate)?
Of course, it sounds like a complex and time-consuming task that goes beyond the scope of a browser translation extension. But so far, I don't see any other options.
@Steindvart yeah it would be great to implement some language detection solution for short texts. If you would be interesting to do it - you could implement an npm package and then I would integrate it via Linguist.
As a backup plan we could rethink when to guess the language. Maybe we should always use page language in case detector can't detect language enough precious. Of course if Chrome API are honest in their scores.