chromium-compact-language-detector
chromium-compact-language-detector copied to clipboard
Unable to handle utf-8 characters that python can handle?
I'm trying to use cld2 on some scraped web data, and I am running into some encoding issues. The text is scraped with beatiful soup into a unicode format, and the from-format is specified to beautiful soup as utf-8. The html of the document declared that it was in utf-8. Below, I have included an example of one of the strings that I anonymized with some filler text.
When I try to encode or decode this text, python does not have any issues. When I try to run it through cld2, however, I get errors.
>>> test=u"\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n"
>>> test.encode('utf8')
'\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xc2\xa325 more filler.\nadditilnal filler.\n\nyet more\xc2\xa0still more\xc2\xa0filler.\n\n\xc2\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n'
>>> test.encode('utf8').decode('utf8')
u'\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n'
>>> cld2.detect(test)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 52: ordinal not in range(128)
>>> cld2.detect(test.encode('utf8'))
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
cld2.error: input contains invalid UTF-8 around byte 158 (of 278)
>>> test.encode('utf8')[158:168]
'\x03\n\t\t\t\t\t\t '
Am I not using this correctly? The characters appear to be legitimate, but cld2 is giving me a hard time.
Hi,
I agree with DataNeel, I've seen the code blow up with some of the UTF8 control characters (https://en.wikipedia.org/wiki/C0_and_C1_control_codes).
E.g. the sample text: "is Able", which is made up of the following characters:
0 i 105 Ll
1 s 115 Ll
2
133 Cc
3 32 Zs
4 A 65 Lu
5 b 98 Ll
6 l 108 Ll
7 e 101 Ll
The second number is the ordinal value of the char and the final column is the unicode category as given by unicodedata.category(char)
.
Where the funky character after "is" is the python char u"\u0085" (http://www.fileformat.info/info/unicode/char/85/index.htm) - a valid UTF-8 character.
Passing this the the latest version of the language detector yields the error:
error: input contains invalid UTF-8 around byte 4 (of 9)
I'm also experiencing this. Any workarounds?
This problem is also happening with me. Has any progress been made?
Ditto - any updates?
Hello, Any news?
Nope. I just updated my code to ignore those errors – it was only .05% of my data :)
From: Motaz Saad <[email protected]mailto:[email protected]> Reply-To: mikemccand/chromium-compact-language-detector <[email protected]mailto:[email protected]> Date: Friday, December 2, 2016 at 9:14 PM To: mikemccand/chromium-compact-language-detector <[email protected]mailto:[email protected]> Cc: Zachary Steinert-Threlkeld <[email protected]mailto:[email protected]>, Comment <[email protected]mailto:[email protected]> Subject: Re: [mikemccand/chromium-compact-language-detector] Unable to handle utf-8 characters that python can handle? (#22)
Hello, Any news?
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/mikemccand/chromium-compact-language-detector/issues/22#issuecomment-264617532, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJys2MjotYrW45E80ivmL3k2OREMEczks5rEPqqgaJpZM4F0OHg.
My workaround for this error : when this happen I just clean my html with something like:
printable_str = ''.join(x for x in html_str if x in string.printable)
then re-launch the detect on this.
It's fine for me since it only happens rarely
Thanks!
From: lcalem <[email protected]mailto:[email protected]> Reply-To: mikemccand/chromium-compact-language-detector <[email protected]mailto:[email protected]> Date: Monday, December 5, 2016 at 7:00 AM To: mikemccand/chromium-compact-language-detector <[email protected]mailto:[email protected]> Cc: Zachary Steinert-Threlkeld <[email protected]mailto:[email protected]>, Comment <[email protected]mailto:[email protected]> Subject: Re: [mikemccand/chromium-compact-language-detector] Unable to handle utf-8 characters that python can handle? (#22)
My workaround for this error : when this happen I just clean my html with something like: printable_str = ''.join(x for x in html_str if x in string.printable) then re-launch the detect on this.
It's fine for me since it only happens rarely
— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/mikemccand/chromium-compact-language-detector/issues/22#issuecomment-264875813, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJys8_vvQpdZY5YFOlr1orT-VkC0U16ks5rFCcXgaJpZM4F0OHg.
@lcalem Just a note: string.printable
only contains ASCII printable characters. When dealing with multiple languages, that can be a major limitation (e.g. it will remove all Chinese characters from a string in Chinese).
In Python 3, it's possible to use the isprintable()
string method like this:
printable_str = ''.join(x for x in html_str if x in x.isprintable())
Just a minor correction. in
is not needed in the if statement. It should be:
printable_str = ''.join(x for x in html_str if x.isprintable())
a better workaround is
text = ''.join([l for l in text if unicodedata.category(unicode(l))[0] not in ('S', 'M', 'C')])
omitting only undesired utf8 chars see http://www.fileformat.info/info/unicode/category/index.htm
It's actually only the Cc
and Cs
unicode categories that throw this error as far as I can tell. Using regex
to remove them as suggested here should do the trick.
import regex
RE_BAD_CHARS = regex.compile(r"\p{Cc}|\p{Cs}")
def remove_bad_chars(text):
return RE_BAD_CHARS.sub("", text)
remove_bad_chars("A\x96 bad char") # Cc category
# 'A bad char'
I brute-forced each unicode character through polyglot
on py38 (ref https://github.com/aboSamoor/polyglot/issues/71#issuecomment-707997790):
Brute-force script
import sys
import unicodedata
from collections import defaultdict
unicode_characters_per_category = defaultdict(list)
for c in map(chr, range(sys.maxunicode + 1)):
unicode_characters_per_category[unicodedata.category(c)].append(c)
all_categories = [
"Cc", # Control 65
"Cf", # Format 161
"Co", # Private Use 0
"Cs", # Surrrogate 0
"Ll", # Lowercase Letter 2,151
"Lm", # Modifier Letter 259
"Lo", # Other Letter 121,414
"Lt", # Titlecase Letter 31
"Lu", # Uppercase Letter 1,788
"Mc", # Spacing Mark 429
"Me", # Enclosing Mark 13
"Mn", # Nonspacing Mark 1,826
"Nd", # Decimal Number 630
"Nl", # Letter Number 236
"No", # Other Number 888
"Pc", # Connector Punctuation 10
"Pd", # Dash Punctuation 24
"Pe", # Close Punctuation 73
"Pf", # Final Punctuation 10
"Pi", # Initial Punctuation 12
"Po", # Other Punctuation 588
"Ps", # Open Punctuation 75
"Sc", # Currency Symbol 62
"Sk", # Modifier Symbol 121
"Sm", # Math Symbol 948
"So", # Other Symbol 6,160
"Zl", # Line Separator 1
"Zp", # Paragraph Separator 1
"Zs", # Space Separator 17
]
from polyglot.text import Text
error_cats = set()
for cat in all_categories:
for char in unicode_characters_per_category[cat]:
try:
Text(char).words
except:
error_cats.add(cat)
# all categories that errored
print(error_cats)