langdetect
langdetect copied to clipboard
Port of Google's language-detection library to Python.
Hi, I sometimes get a LangDetectException which tells me: 'Need to load profiles.' Is there a way to check if all languages have been loaded before calling the detect method?...
When I try using `detect` on French text, the language detection is way off. For example: bonjour -> 'hr' (croatian) je m'appelle -> 'sl' (slovenian)
in networking time constraint exists, thus only x time for detect. Not only does this take `timeit.timeit(lambda: detect("War doesn't s left."), number=1000)` 34s it goes also to 'nl' instead eng....
library is unable to detect language for basic english words and hence generates poor inaccurate results as depicted below. `detect("sunday")` => 'id' | whereas clearly 'sunday' in indonesian is minggu...
Added a list as language limitation for load_profiles. Also implemented in `detect(text, languages=[])` and `detect_langs(text, languages=[])`. Auto reloading the `_factory` when the language selection changes.
The original langdetect in C++ has a very nice "early abort" efficiency optimization. Could "detect" accept some form of lazy-loading (I'd suggest being able to pass a python file object),...
e.g. If I solely want the confidence of english detect(text,'en') is this possible?. May just fork and add this feature. I realize it is a non-deterministic, possibly softmax output but...
In: langdetect.detect(u'就了快速大幅') Out: 'ko' But the string is definitely Chinese. The problem is that, there are so many Chinese character in profiles/ko So, I remove it using the script fix-ko.py
Profiles generated from wikipedia abstracts ([kuwiki-20170520-abstract.xml](https://dumps.wikimedia.org/kuwiki/20170520/kuwiki-20170520-abstract.xml) and [ckbwiki-20170520-abstract.xml](https://dumps.wikimedia.org/ckbwiki/20170520/ckbwiki-20170520-abstract.xml) for Sorani Kurdish (ku) and Central Kurdish (ckb) languages respectively)
Added si (Sinhalese) profile to the langdetect. Tested for functionality.