argos-translate icon indicating copy to clipboard operation
argos-translate copied to clipboard

Support Language Detection

Open PJ-Finlay opened this issue 4 years ago • 13 comments

The plan for this was to train a model using the existing infrastructure that maps from input text to a language code. This would require adding a way to generate this data in the training scripts and what is hopefully a pretty small code change to support this. I'd be pretty optimistic about this just working pretty well out of the box but it may take some tweaking.

PJ-Finlay avatar Dec 21 '20 23:12 PJ-Finlay

This would be pretty useful for any automated translation mechanism!

pierotofy avatar Dec 23 '20 03:12 pierotofy

Interesting, I think using the same pipeline would be a good long term solution but this could be a something to do in the meantime. One issue with using the pipeline is that as soon as a we add a new language we have to also retrain the detector. This would probably also be lighter weight vs a 100MB model file. The main interest for this is currently from LibreTranslate so if someone wants to extend the Python API to use this that would be welcome and then the API could be reimplemented in the future if it makes sense.

PJ-Finlay avatar Jan 12 '21 13:01 PJ-Finlay

Some support was added to LibreTranslate in https://github.com/uav4geo/LibreTranslate/pull/12

thomas536 avatar Jan 18 '21 19:01 thomas536

Recently I saw an article about the comparison of language detection tools. FastText can be a viable option instead of langdetect, because it is lot faster. image

We have an another option which can be quite accurate in case of longer texts: N-grams. There are predetermined n-grams for all supported languages and it is easy the generate new lists. The advantages of using this approach is that the models are really small, the implementation is easy and we it does not need any extra library. In any case, if help needed, I can implement these.

hollorol avatar Oct 30 '21 07:10 hollorol

@hollorol If you can do this with jus the Python standard library a pull request would be appreciated.

PJ-Finlay avatar Oct 30 '21 21:10 PJ-Finlay

@PJ-Finlay, I'll do it only for the cli, because I don't use the GUI part of the program; but I guess after it, adapt it to the GUI will be easy.

hollorol avatar Oct 31 '21 10:10 hollorol

That sounds good, it should probably be it's own file/module that can be integrated into the CLI.

PJ-Finlay avatar Oct 31 '21 12:10 PJ-Finlay

Lingua might be useful for this. Lingua is made with python, works with short strings, works offline, and licensed under Apache-2.0.

TechnologyClassroom avatar Jan 11 '22 21:01 TechnologyClassroom

LibreTranslate already has a system for language detection so this hasn't been a priority. My plan was to use CTranslate2 models to map input text into a language code but open to suggestions.

PJ-Finlay avatar Jan 11 '22 23:01 PJ-Finlay

Not everyone uses LibreTranslate.

TechnologyClassroom avatar Jan 12 '22 14:01 TechnologyClassroom

The way Argos Translate currently works it would be a breaking change to add this but I'm planning to add it in the next major version. It would also be possible to add language detection to the GUI (which is in a separate repo) using a third party library like Lingua.

PJ-Finlay avatar Jan 13 '22 00:01 PJ-Finlay

I could see it being used like a special input that would trigger the language detection. Syntax could be something like this:

echo "Text to translate" | argos-translate --from-lang auto-detect --to-lang en

TechnologyClassroom avatar Jan 13 '22 15:01 TechnologyClassroom

This is the way to do it for core Argos Translate, the only thing I might change is "detect" instead of "auto-detect".

PJ-Finlay avatar Jan 14 '22 00:01 PJ-Finlay