langid.py Seeking advice regarding classification problem only present with Chinese

Hello,

I have some sample texts, which originate in PDFs, with my goal being to classify the language automatically. I've extracted the text content with pdfminer and whilst langid works excellently with all my samples in a variety of languages, it seems to have problems for me when I run it with Chinese (I have samples in both simplified or traditional) because it always suggests 'en'.

Does anyone have any advice on how I should approach investigating what the problem might be?

Are there any standard example documents that I could try that would confirm there isn't something quirky with my PDF extraction?

I could be wrong, but I don't think it's necessarily a UTF-8 encoding issue as I have managed to get it working with other non-Latin texts (eg Cyrillic).

The languages that I've found to work with my samples, so far, are: en, it, de, ru. I will be checking pt, fr, pl and ja ones shortly.

There is a tiny portion of English in the header section, but that does not throw off the language detection for the other samples and I have tried focusing on pages where the body of the text is entirely Chinese and present in significantly larger quantities than in the header.

It also makes no difference if I preselect the languages (unfortunately the false suggestion of English needs to be in the list, as there are likely to be samples in English present)

langid.set_languages(['en','es','pt','fr','ru','pl','de','it','ja', 'zh'])

Even if I try taking out English then it merely suggests a different wrong language (eg German), although the confidence level is fairly low (eg typically 0.16 to 0.25, whether it guesses English or German).

My set up is Windows 7, with Python 2.7 (needed due to use of PDFMiner, although I could try Python 3.5 if it was thought to solve the issue).

Many thanks, Neil

Mar 01 '16 13:03 nmstoker

Are you sure the documents are in UTF-8? Windows software would often default to UTF-16 (if not some legacy code page).

Mar 01 '16 13:03 tripleee

This definitely sounds like an encoding issue on the document side. When we trained langid.py we tried to include a representative sample of encodings, but I think the coverage for Chinese might be pretty poor. It's possible to retrain langid.py but this requires a bit of effort and training data. As @tripleee points out, windows often uses UTF-16, and quite a bit of the langid.py training data is in UTF-8. The easiest thing to try might be to transcode all documents (perhaps PDFMiner supports this directly? I'm not familiar) to UTF-8 and try again. Hope that helps!

Mar 08 '16 21:03 saffsd

For what it's worth, I see the opposite issue: bias towards Chinese

¡No! (only 24%) ‪#‎WCIT Tʻagavorn apracʻ kenna ՏԵՍԱՆՅՈՒԹ #Cizre #MustRead (only 77%) Աֆրիկա (2nd, only 14%)

All are identified as Chinese, generally with > 98% probability.

Perhaps the Chinese data is actually all in the Latin alphabet? This should be the easiest language to keep separate, so it reeks of fundamental bug or preproc issue.

Mar 10 '16 12:03 bittlingmayer

Pardon, looks like in most cases it is the result of invisible chars in dirty data. (But ՏԵՍԱՆՅՈՒԹ and ¡No! are clean, and ʻ in Tʻagavorn apracʻ kenna is not so exotic.

Mar 10 '16 13:03 bittlingmayer