TextBlob icon indicating copy to clipboard operation
TextBlob copied to clipboard

Translation 1 or 2 character Chinese words

Open NijntjePluis opened this issue 9 years ago • 2 comments

Thank you for making this amazing tool!!

I have an easy issue. I use textblob to translate Chinese. In Translate.py in def detect with comment """Detect the source text's language.""" requires a minimum length of 3 otherwise it will through an exception. This is not so handy for Chinese characters. For example: 好 means "it is good".

Maybe there is away to change this without losing the effectiveness of detecting the correct language. If you can detect if they are Chinese characters you could drop the minimum length requirement.

Thanks and keep up the good work :)

NijntjePluis avatar Feb 10 '16 09:02 NijntjePluis

Interesting - the Google Translate API returns 500 Internal Server Error on strings less than 3 characters long, which is presumably the reason for this limitation in the code:

>>> from urllib2 import urlopen, Request
>>> headers={'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_6_8) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168 Safari/535.19'}
>>> url='http://translate.google.com/translate_a/t'
>>> data=urlencode({'text': 'f', 'oe': 'UTF-8', 'client': 'p', 'ie': 'UTF-8'}).encode('utf-8')
>>> data
'text=f&oe=UTF-8&client=p&ie=UTF-8'
>>> urlopen(Request(url=url, headers=headers, data=data))
urllib2.HTTPError: HTTP Error 500: Internal Server Error

However, your example works fine:

>>> data=urlencode({'text': '好', 'oe': 'UTF-8', 'client': 'p', 'ie': 'UTF-8'}).encode('utf-8')
>>> data
'text=%E5%A5%BD&oe=UTF-8&client=p&ie=UTF-8'
>>> urlopen(Request(url=url, headers=headers, data=data))
<addinfourl at 140466353800976 whose fp = <socket._fileobject object at 0x7fc0e15d6850>>

We can probably just remove the restriction, and raise if we get a 500 Error from the API. I'll test this a bit more, and make the change if possible.

jschnurr avatar Feb 11 '16 02:02 jschnurr

Currently I did the same to bypass the issue. Maybe the google api decodes the chinese character which results in a three character long word..

But super you are looking into it. Thanks

NijntjePluis avatar Feb 11 '16 10:02 NijntjePluis