dl-translate
dl-translate copied to clipboard
Detect source language with langdetect package
The langdetect has worked well for me in the past for language detection problems. How would you feel about allowing users to pass 'auto'
as an option for source
? I could see some pros and cons:
Pros
- Users don't need to be able to recognize a language to translate
- Eliminates pre-classification of languages if your dataset contains multiple languages
Cons
- Adds another dependency
-
langdetect
detects these 55 languages only
I'm a little new to open source but I would love to contribute 🙂 Of course, if you feel this doesn't fit this package's mission that's totally understandable.
Hey langdetect is cool! However it seems there's many options for language detection, including fasttext and langid.py. Each option will have a certain accuracy (none of them are 100%) and speed - so I feel it might be difficult to choose for the end user.
Also since we are now using m2m100
by default, it might create confusion with users that try to auto-detect a language that's not available with the chosen detection algorithm (but available in m2m100).
I think a good option would be to start with a section in the user guide showing how to use any (or all) of the language detection libraries. Then from there, we could build a util function along the lines of:
src = dlt.lang.detect(source_text, backend="fasttext") # or backend="langdetect" or backend="langid"
mt.translate(source_text, source=src,...)
Which will throw an error that requires a user to install the library if they want to use a specific backend.
Those are some good points, I agree it would be confusing to have the library detect a language but not translate it. I'll take a look into writing something that could potentially put into the user guide.
Thank you. Once we have something in the user guide I'd welcome another PR that'd update dlt.utils
or dlt.lang
as well, if you wish!
Hi, Any updates about this issue. Is there any hint for making language source auto-detected?
@banyous Feel free to contribute a section in the user guide about using language detection, and from there, if we feel a wrapper around fasttext would make life easier, then I'm happy to welcome a PR to add language detection to dlt.utils
or dlt.lang
I think this is a decent starting point: https://fasttext.cc/docs/en/language-identification.html