linguist icon indicating copy to clipboard operation
linguist copied to clipboard

Improve language detection

Open vitonsky opened this issue 8 months ago • 4 comments

Language detection works bad for chriomium-based browsers

https://github.com/user-attachments/assets/93782cf4-9454-4cad-8963-3cd1c861b024

Let's looking for alternative solution to embedded in browser

vitonsky avatar Jul 27 '25 20:07 vitonsky

Bug mostly reproduces when user select one or few words. So the solution must handle primarily this use case.

We could make contest between browser API and alternative solution and choose those language that return the best score.

vitonsky avatar Jul 27 '25 20:07 vitonsky

@vitonsky

Yes, I have been dealing with this issue for a long time and wanted to raise an issue about it myself. And then I stumbled on this issue with a good demonstration of this behavior.

I did a little research on alternative auto-detection tools. And I didn't find any suitable open source tools :(

If there are any, they are either:

  • a) Don't work with short text.
  • b) Produce inaccurate results, like the one currently being reproduced.
  • c) Are interesting, but limited in their demonstration capabilities. They can only be evaluated through auto-tests in source code, which are very limited and only consider the best cases. Therefore, it is difficult to draw definitive conclusions about their effectiveness through a quick review.

Here is what I research:

  • https://laszlopandy.github.io/ts-language-detection/
  • https://developer.mozilla.org/en-US/docs/Web/API/Translator_and_Language_Detector_APIs/Using#result
  • https://komodojp.github.io/tinyld
  • https://github.com/nitotm/efficient-language-detector-js
  • https://github.com/fabiospampinato/lande
  • And some others from the list: https://www.npmjs.com/search?page=0&q=language%20detection&sortBy=score

steindvart avatar Nov 01 '25 09:11 steindvart

While conducting this small research, I thought about something else: maybe we could try to make the own language detection solution? One based on algorithms used in open source translators, such as LibreTranslate (https://github.com/LibreTranslate/LibreTranslate)?

Of course, it sounds like a complex and time-consuming task that goes beyond the scope of a browser translation extension. But so far, I don't see any other options.

steindvart avatar Nov 01 '25 09:11 steindvart

@Steindvart yeah it would be great to implement some language detection solution for short texts. If you would be interesting to do it - you could implement an npm package and then I would integrate it via Linguist.

As a backup plan we could rethink when to guess the language. Maybe we should always use page language in case detector can't detect language enough precious. Of course if Chrome API are honest in their scores.

vitonsky avatar Nov 01 '25 10:11 vitonsky