chromium-compact-language-detector icon indicating copy to clipboard operation
chromium-compact-language-detector copied to clipboard

Unable to handle utf-8 characters that python can handle?

Open DataNeel opened this issue 9 years ago • 12 comments

I'm trying to use cld2 on some scraped web data, and I am running into some encoding issues. The text is scraped with beatiful soup into a unicode format, and the from-format is specified to beautiful soup as utf-8. The html of the document declared that it was in utf-8. Below, I have included an example of one of the strings that I anonymized with some filler text.

When I try to encode or decode this text, python does not have any issues. When I try to run it through cld2, however, I get errors.

>>> test=u"\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n"
>>> test.encode('utf8')
'\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xc2\xa325 more filler.\nadditilnal filler.\n\nyet more\xc2\xa0still more\xc2\xa0filler.\n\n\xc2\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n'
>>> test.encode('utf8').decode('utf8')
u'\n\n\n\n\n\n\n\n\n\n\n\n\n\nHi <<First Name>>\nthis is filler text \xa325 more filler.\nadditilnal filler.\n\nyet more\xa0still more\xa0filler.\n\n\xa0\n\n\n\n\nmore\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nfiller.\x03\n\t\t\t\t\t\t    almost there \n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\nthe end\n\n\n\n\n\n\n\n\n\n\n\n\n'
>>> cld2.detect(test)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xa3' in position 52: ordinal not in range(128)
>>> cld2.detect(test.encode('utf8'))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
cld2.error: input contains invalid UTF-8 around byte 158 (of 278)
>>> test.encode('utf8')[158:168]
'\x03\n\t\t\t\t\t\t  '

Am I not using this correctly? The characters appear to be legitimate, but cld2 is giving me a hard time.

DataNeel avatar Aug 28 '15 17:08 DataNeel

Hi,

I agree with DataNeel, I've seen the code blow up with some of the UTF8 control characters (https://en.wikipedia.org/wiki/C0_and_C1_control_codes).

E.g. the sample text: "is… Able", which is made up of the following characters:

0 i 105 Ll
1 s 115 Ll
2 …133 Cc
3    32 Zs
4 A  65 Lu
5 b  98 Ll
6 l 108 Ll
7 e 101 Ll

The second number is the ordinal value of the char and the final column is the unicode category as given by unicodedata.category(char).

Where the funky character after "is" is the python char u"\u0085" (http://www.fileformat.info/info/unicode/char/85/index.htm) - a valid UTF-8 character.

Passing this the the latest version of the language detector yields the error:

error: input contains invalid UTF-8 around byte 4 (of 9)

smithsimonj avatar Dec 10 '15 14:12 smithsimonj

I'm also experiencing this. Any workarounds?

carlosdubus avatar Apr 12 '16 19:04 carlosdubus

This problem is also happening with me. Has any progress been made?

matheusportela avatar May 27 '16 14:05 matheusportela

Ditto - any updates?

ZacharyST avatar Dec 02 '16 18:12 ZacharyST

Hello, Any news?

motazsaad avatar Dec 03 '16 05:12 motazsaad

Nope. I just updated my code to ignore those errors – it was only .05% of my data :)

From: Motaz Saad <[email protected]mailto:[email protected]> Reply-To: mikemccand/chromium-compact-language-detector <[email protected]mailto:[email protected]> Date: Friday, December 2, 2016 at 9:14 PM To: mikemccand/chromium-compact-language-detector <[email protected]mailto:[email protected]> Cc: Zachary Steinert-Threlkeld <[email protected]mailto:[email protected]>, Comment <[email protected]mailto:[email protected]> Subject: Re: [mikemccand/chromium-compact-language-detector] Unable to handle utf-8 characters that python can handle? (#22)

Hello, Any news?

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/mikemccand/chromium-compact-language-detector/issues/22#issuecomment-264617532, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJys2MjotYrW45E80ivmL3k2OREMEczks5rEPqqgaJpZM4F0OHg.

ZacharyST avatar Dec 04 '16 18:12 ZacharyST

My workaround for this error : when this happen I just clean my html with something like: printable_str = ''.join(x for x in html_str if x in string.printable) then re-launch the detect on this.

It's fine for me since it only happens rarely

lcalem avatar Dec 05 '16 15:12 lcalem

Thanks!

From: lcalem <[email protected]mailto:[email protected]> Reply-To: mikemccand/chromium-compact-language-detector <[email protected]mailto:[email protected]> Date: Monday, December 5, 2016 at 7:00 AM To: mikemccand/chromium-compact-language-detector <[email protected]mailto:[email protected]> Cc: Zachary Steinert-Threlkeld <[email protected]mailto:[email protected]>, Comment <[email protected]mailto:[email protected]> Subject: Re: [mikemccand/chromium-compact-language-detector] Unable to handle utf-8 characters that python can handle? (#22)

My workaround for this error : when this happen I just clean my html with something like: printable_str = ''.join(x for x in html_str if x in string.printable) then re-launch the detect on this.

It's fine for me since it only happens rarely

— You are receiving this because you commented. Reply to this email directly, view it on GitHubhttps://github.com/mikemccand/chromium-compact-language-detector/issues/22#issuecomment-264875813, or mute the threadhttps://github.com/notifications/unsubscribe-auth/AFJys8_vvQpdZY5YFOlr1orT-VkC0U16ks5rFCcXgaJpZM4F0OHg.

ZacharyST avatar Dec 05 '16 16:12 ZacharyST

@lcalem Just a note: string.printable only contains ASCII printable characters. When dealing with multiple languages, that can be a major limitation (e.g. it will remove all Chinese characters from a string in Chinese).

In Python 3, it's possible to use the isprintable() string method like this:

printable_str = ''.join(x for x in html_str if x in x.isprintable())

ales-t avatar Jan 05 '18 10:01 ales-t

Just a minor correction. in is not needed in the if statement. It should be:

printable_str = ''.join(x for x in html_str if x.isprintable())

andreoua avatar Oct 29 '18 13:10 andreoua

a better workaround is

text = ''.join([l for l in text if unicodedata.category(unicode(l))[0] not in ('S', 'M', 'C')])

omitting only undesired utf8 chars see http://www.fileformat.info/info/unicode/category/index.htm

gilko1981 avatar Nov 05 '18 14:11 gilko1981

It's actually only the Cc and Cs unicode categories that throw this error as far as I can tell. Using regex to remove them as suggested here should do the trick.

import regex

RE_BAD_CHARS = regex.compile(r"\p{Cc}|\p{Cs}")

def remove_bad_chars(text):
    return RE_BAD_CHARS.sub("", text)

remove_bad_chars("A\x96 bad char")  # Cc category
# 'A bad char'

I brute-forced each unicode character through polyglot on py38 (ref https://github.com/aboSamoor/polyglot/issues/71#issuecomment-707997790):

Brute-force script
import sys
import unicodedata
from collections import defaultdict

unicode_characters_per_category = defaultdict(list)
for c in map(chr, range(sys.maxunicode + 1)):
    unicode_characters_per_category[unicodedata.category(c)].append(c)

all_categories = [
    "Cc",  # Control 65
    "Cf",  # Format  161
    "Co",  # Private Use 0
    "Cs",  # Surrrogate  0
    "Ll",  # Lowercase Letter    2,151
    "Lm",  # Modifier Letter 259
    "Lo",  # Other Letter    121,414
    "Lt",  # Titlecase Letter    31
    "Lu",  # Uppercase Letter    1,788
    "Mc",  # Spacing Mark    429
    "Me",  # Enclosing Mark  13
    "Mn",  # Nonspacing Mark 1,826
    "Nd",  # Decimal Number  630
    "Nl",  # Letter Number   236
    "No",  # Other Number    888
    "Pc",  # Connector Punctuation   10
    "Pd",  # Dash Punctuation    24
    "Pe",  # Close Punctuation   73
    "Pf",  # Final Punctuation   10
    "Pi",  # Initial Punctuation 12
    "Po",  # Other Punctuation   588
    "Ps",  # Open Punctuation    75
    "Sc",  # Currency Symbol 62
    "Sk",  # Modifier Symbol 121
    "Sm",  # Math Symbol 948
    "So",  # Other Symbol    6,160
    "Zl",  # Line Separator  1
    "Zp",  # Paragraph Separator 1
    "Zs",  # Space Separator 17
]

from polyglot.text import Text

error_cats = set()
for cat in all_categories:
    for char in unicode_characters_per_category[cat]:
        try:
            Text(char).words
        except:
            error_cats.add(cat)

# all categories that errored
print(error_cats)

ddelange avatar Oct 13 '20 20:10 ddelange