langid.py issues

error in running IGweight.py

2

Hello I have the following error when I run IGweight.py `computing information gain Traceback (most recent call last): File "/home/motaz/tmp/langid.py/langid/train/IGweight.py", line 246, in ig = compute_IG(bucketlist, features, dist, args.binarize, suffix,...

motazsaad

normalization in hindi

HI I found the pre-train model have bad result on normalized hindi. eg. print(identifier.rank("तुम कहाँ जा रहे हो")) # this on is correct [('hi', 0.5811032824612302), ('ne', 0.41881502578401597), ('mr', 8.169175475378807e-05)....] print(identifier.rank("tum...

godkillok

Different classification for upper/lower-case sentences

I'm noticing something very peculiar with all-caps strings. Running `langid -n`: ``` >>> ceci est une phrase française ('fr', 0.9999966296917099) >>> CECI EST UNE PHRASE FRANÇAISE ('pt', 0.4985860132092562) >>> this...

plu97

Detection error when processing full-width letters

1

When processing full-width letters, it returns "Chinese" as result: `>>> import langid` `>>> langid.classify('ＡＢＣ')` `('zh', 0.9668056948707975)`

joewong826

Repetition of words causes detection error

2

When I input strings like 'hello world hello world hello world', _langid_ can't identify it as English text. `>>> import langid` `>>> langid.classify('hello world hello world hello world')` `('af', 0.683057652874482)`

joewong826

Repeating string yields different results

Knowing that most lang id systems perform worse on short strings, I have been experimenting with normalising the length: ``` MIN_LEN = 30 id = langid.rank(s)[0] print langid.rank(s)[0] while len(s)...

bittlingmayer

Seeking advice regarding classification problem only present with Chinese

4

Hello, I have some sample texts, which originate in PDFs, with my goal being to classify the language automatically. I've extracted the text content with **pdfminer** and whilst langid works...

nmstoker

some quotes ("️) causes classification as Chinese

1

I have noticed that 'wunderbar' is classified as Chinese, but only sometimes. Well, you see why: ``` >>> langid.rank(' wunderbar') [('de', 0.9778415187189662), ('ms', 0.010616691993507496), ('rw', 0.005629123117595187), ('jv', 0.002381279333979642), ('en', 0.0012907605583217631),...

bittlingmayer

Fixing gzip problem in NBtrain.py.

1

The gzip problem, reported issue #20, still persists after cloning a fresh version of langid.py now. Added the following workaround for NBtrain.py to call open() in common.py instead of gzip.open()

gr33ndata

langid.py
langid.py copied to clipboard

Metadata

error in running IGweight.py

normalization in hindi

Different classification for upper/lower-case sentences

Detection error when processing full-width letters

Repetition of words causes detection error

Repeating string yields different results

Seeking advice regarding classification problem only present with Chinese

some quotes ("️) causes classification as Chinese

Fixing gzip problem in NBtrain.py.

← Metadata

Owner

Metadata

langid.py langid.py copied to clipboard

Metadata

← Metadata

Owner

Metadata

langid.py
langid.py copied to clipboard