spacy-cld
spacy-cld copied to clipboard
error: input contains invalid UTF-8
Hi, I am adding spacy-cld as a component of a spacy pipeline. I am getting the following error with the string 'Conanza'.
lib/python3.6/site-packages/spacy_cld/spacy_cld.py", line 20, in detect_languages\n _, _, results = detect(text.text)\npycld2.error: input contains invalid UTF-8 around byte 3 (of 8)\n']
The word looks like above when you print it. print('Conanza'.encode(encoding='utf-8')) b'Con\x7fanza'
Thanks
@lizisepul hard to tell from your comment, but is the original string proper UTF-8? If not, CLD won't be able to handle it, and I'd recommend filtering out those characters before feeding to this package.
One problem is that the pypi package still suffers from this issue: https://github.com/nickdavidhaynes/spacy-cld/issues/1 Can the package receive an update? Or can you make a pre-release on Github? If the spaCy pipeline doesn't crash on unicode errors it's easier to handle these cases.