langid.py icon indicating copy to clipboard operation
langid.py copied to clipboard

Stand-alone language identification system

Results 29 langid.py issues
Sort by recently updated
recently updated
newest added

Hello I have the following error when I run IGweight.py `computing information gain Traceback (most recent call last): File "/home/motaz/tmp/langid.py/langid/train/IGweight.py", line 246, in ig = compute_IG(bucketlist, features, dist, args.binarize, suffix,...

HI I found the pre-train model have bad result on normalized hindi. eg. print(identifier.rank("तुम कहाँ जा रहे हो")) # this on is correct [('hi', 0.5811032824612302), ('ne', 0.41881502578401597), ('mr', 8.169175475378807e-05)....] print(identifier.rank("tum...

I'm noticing something very peculiar with all-caps strings. Running `langid -n`: ``` >>> ceci est une phrase française ('fr', 0.9999966296917099) >>> CECI EST UNE PHRASE FRANÇAISE ('pt', 0.4985860132092562) >>> this...

When processing full-width letters, it returns "Chinese" as result: `>>> import langid` `>>> langid.classify('ABC')` `('zh', 0.9668056948707975)`

When I input strings like 'hello world hello world hello world', _langid_ can't identify it as English text. `>>> import langid` `>>> langid.classify('hello world hello world hello world')` `('af', 0.683057652874482)`

Knowing that most lang id systems perform worse on short strings, I have been experimenting with normalising the length: ``` MIN_LEN = 30 id = langid.rank(s)[0] print langid.rank(s)[0] while len(s)...

Hello, I have some sample texts, which originate in PDFs, with my goal being to classify the language automatically. I've extracted the text content with **pdfminer** and whilst langid works...

I have noticed that 'wunderbar' is classified as Chinese, but only sometimes. Well, you see why: ``` >>> langid.rank(' wunderbar') [('de', 0.9778415187189662), ('ms', 0.010616691993507496), ('rw', 0.005629123117595187), ('jv', 0.002381279333979642), ('en', 0.0012907605583217631),...

The gzip problem, reported issue #20, still persists after cloning a fresh version of langid.py now. Added the following workaround for NBtrain.py to call open() in common.py instead of gzip.open()