spacy-cld icon indicating copy to clipboard operation
spacy-cld copied to clipboard

error: input contains invalid UTF-8

Open lizisepul opened this issue 5 years ago • 2 comments

Hi, I am adding spacy-cld as a component of a spacy pipeline. I am getting the following error with the string 'Conanza'.

lib/python3.6/site-packages/spacy_cld/spacy_cld.py", line 20, in detect_languages\n _, _, results = detect(text.text)\npycld2.error: input contains invalid UTF-8 around byte 3 (of 8)\n']

The word looks like above when you print it. print('Conanza'.encode(encoding='utf-8')) b'Con\x7fanza'

Thanks

lizisepul avatar Sep 27 '18 16:09 lizisepul

@lizisepul hard to tell from your comment, but is the original string proper UTF-8? If not, CLD won't be able to handle it, and I'd recommend filtering out those characters before feeding to this package.

nickdavidhaynes avatar Oct 04 '18 21:10 nickdavidhaynes

One problem is that the pypi package still suffers from this issue: https://github.com/nickdavidhaynes/spacy-cld/issues/1 Can the package receive an update? Or can you make a pre-release on Github? If the spaCy pipeline doesn't crash on unicode errors it's easier to handle these cases.

fako avatar Oct 23 '18 07:10 fako