python-ftfy icon indicating copy to clipboard operation
python-ftfy copied to clipboard

Feature: detect mixups between two single-byte encodings

Open rspeer opened this issue 11 years ago • 16 comments

There is apparently a fair amount of Spanish text out there that contains a mix-up between Windows-1252 and MacRoman before being encoded in UTF-8.

Because Latin-1 for Windows-1252 is the only single-byte mixup we detect, we assume that's what happened, and get text that looks like: "PrevŽn diputados inaugurar periodo de sesiones con c—digo penal".

This is not a false positive, because the encoding is in fact incorrect (it's actually got the UTF-8 encoding of the wrong characters in it), and ftfy is trying to fix it. It's in fact using the same fix that any web browser would use. However, the resulting text makes no sense, because it's not the correct fix.

This mixup is apparently common enough that it would be worth fixing as another special case.

rspeer avatar Jan 29 '14 22:01 rspeer

Is this the same issue or a new one?

>>> s = u'Radio central���Hazme un Instrumento de tu paz����91.9.FM La radio de paz.�,'
>>> print ftfy.fix_text_segment(s)
Radio central���Hazme un Instrumento de tu paz����91.9.FM La radio de paz.�,

Source: http://184.107.166.66:8114/status.xsl

martinblech avatar Sep 30 '14 20:09 martinblech

That one's not an issue. Beneath the mojibake, that's exactly what the text says.

� in Windows-1252 is 0xEF 0xBF 0xBD, the UTF-8 encoding of �, aka U+FFFD REPLACEMENT CHARACTER. Whatever actual Unicode the string was supposed to contain has already been lost.

rspeer avatar Sep 30 '14 20:09 rspeer

There are several open issues that are really the same thing. I'm merging them all into this issue.

rspeer avatar Oct 02 '14 16:10 rspeer

@rspeer Cool! Let me know whether you'd like me to keep posting examples as I find them. I want to help but I don't want to spam :)

martinblech avatar Oct 03 '14 22:10 martinblech

The examples are helpful! I can use them as test cases.

rspeer avatar Oct 12 '14 05:10 rspeer

Related: the mixup of "v3/43/4r" (ASCII-printed high-byte characters) coming from "v¾¾r" (CP850) coming from "vóór" (Windows-1252). See http://stackoverflow.com/questions/17654898/which-encoding-failure-did-encode-v%C3%B3%C3%B3r-into-v3-43-4r

jpluimers avatar Jul 29 '15 18:07 jpluimers

Man. That's an unfortunate mix-up. But it's not one ftfy should fix, because pure ASCII is not something to be messed with.

I should, however, look into "the infamous CP850" and whether ftfy should consider it as a possibility, so that for example it could decode UTF-8 re-interpreted as CP850.

rspeer avatar Jul 29 '15 20:07 rspeer

What about this:

a = '''Liège Avenue de l'Hôpital'''  # french sentence
print(ftfy.fix_text(a.decode('utf8')))

# Out: Liège Avenue de l'HĂ´pital, no change from input, where it should be: Liège Avenue de l'Hôpital

Does this fit into this issue? I could not find any way to correct this (using ftfy or any other method).

lrq3000 avatar Feb 13 '17 16:02 lrq3000

It's been encoded in UTF-8 and decoded in Windows-1250. Here's the code that specifically fixes it (written in a way that should work in Python 2 or 3):

>>> text = u"Liège Avenue de l'Hôpital"
>>> print(text.encode('windows-1250').decode('utf-8'))
Liège Avenue de l'Hôpital

So this is within the scope of ftfy, it's just not a possibility that it currently checks for. I'm aware that Windows-1250 is used somewhat frequently in Eastern Europe, and it's probably a bias in my data collection that I haven't seen many examples of it.

I will open a new issue for this.

rspeer avatar Feb 13 '17 17:02 rspeer

If we have something like this that's not problem >>> print(ftfy.fix_text('ünicode')) ünicode

But if we use mixed encoding types something like this i.e >>> print(ftfy.fix_text('Hi to ℙℽ☂ℌϕℿ ünicode')) Hi to ℙℽ☂ℌϕℿ ünicode

Expected to be(Hi to ℙℽ☂ℌϕℿ ünicode)

Why is this happening? Is this something that this library cannot handle?

Veki2808 avatar Sep 21 '17 13:09 Veki2808

ftfy makes kind of arbitrary decisions about how to handle mixed encodings: it allows the encoding to change at line breaks, and it also decodes the most common mojibake sequences like • even when they're inconsistent with the surrounding line.

Encoding a combining umlaut as ̈ isn't common enough to fall into that second case.

rspeer avatar Sep 21 '17 15:09 rspeer

Man. That's an unfortunate mix-up. But it's not one ftfy should fix, because pure ASCII is not something to be messed with.

I should, however, look into "the infamous CP850" and whether ftfy should consider it as a possibility, so that for example it could decode UTF-8 re-interpreted as CP850.

Do you want me to make a separate issue out of https://github.com/LuminosoInsight/python-ftfy/issues/18#issuecomment-126046344 and point back to https://github.com/LuminosoInsight/python-ftfy/issues/18 ?

jpluimers avatar Jun 28 '21 12:06 jpluimers

@rspeer I found back an offending document for "v3/43/4r" at Wayback] g428-1.pdf

Do you want me to make a separate issue out of it?

jpluimers avatar Apr 24 '22 18:04 jpluimers

@jpluimers I can't tell what the Unicode issue you're linking is -- what's the mojibaked text, and can you tell which encodings were mixed up?

rspeer avatar Apr 25 '22 14:04 rspeer

trying to follow this: do you have a place where it says "v¾¾r" and hasn't been flattened into "v3/43/4r"?

rspeer avatar Apr 25 '22 14:04 rspeer

trying to follow this: do you have a place where it says "v¾¾r" and hasn't been flattened into "v3/43/4r"?

Didn't have time to dig for this earlier, as there were too many "5-minute" things that ended up taking loads more time.

I should have rescheduled, as https://www.google.com/search?q=%22v%C2%BE%C2%BEr%22 was indeed less than a "5 minute thing". A few of the results:

  • https://commons.wikimedia.org/wiki/File:Overdracht_van_goederen_voor_Griekenland._Het_verhuisbedrijf_Beck_v%C2%BE%C2%BEr_het_stadhuis_jan._1951_GN15237.jpeg (in the title)
  • https://disco.datawonen.nl/variable?versieid=WOON15V2&variablecode=ZelfdeHH (in one of the first headings)
  • https://www.kroonint.nl/bakta-tabletten-6x250-stuks (in the body text)

Then came the hard part, trying to make sense of what ftfy can do (:

For all the above texts, https://ftfy.vercel.app/ wrongly comes up with something like this (note I pasted v¾¾r)

s = 'v¾¾r'
s = s.encode('latin-1')
s = s.decode('utf-8')
print(s)

You can very the wrong handling using for instance https://ftfy.vercel.app/?s=v%C2%BE%C2%BEr

I verified this at https://www.python.org/shell/:

Python 3.9.5 (default, May 27 2021, 19:45:35) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.system('sh')
$ pip3 install ftfy
Defaulting to user installation because normal site-packages is not writeable
Looking in links: /usr/share/pip-wheels
Collecting ftfy
  Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
     |████████████████████████████████| 53 kB 2.8 MB/s 
Requirement already satisfied: wcwidth>=0.2.5 in /usr/local/lib/python3.9/site-packages (from ftfy) (0.2.5)
Installing collected packages: ftfy
Successfully installed ftfy-6.1.1
$ python
Python 3.9.5 (default, May 27 2021, 19:45:35) 
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ftfy
>>> s = 'v¾¾r'
>>> s = s.encode('latin-1')
>>> s = s.decode('utf-8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 1: invalid start byte
>>> print(s)
b'v\xbe\xber'
>>> 

On the console, it does not get recognised either in the same Python session:

>>> ftfy.fix_and_explain("v¾¾r")
ExplainedText(text='v¾¾r', explanation=[])
>>> 

But it is indeed "the infamous CP850", as when I continue this in the same Python session:

>>> s = 'v¾¾r'
>>> s = s.encode('cp850')
>>> s = s.decode('latin-1')
>>> print (s)
vóór
>>> 

If you could have ftfy search for 3/43/4 then treat it like ¾¾ and recognise it as a potential CP3850 problem, especially in Dutch texts.

Even for non-Dutch texts, ¾¾ is often an encoding problem. I browsed through the first 10 pages of results for https://www.google.com/search?q=%22%C2%BE%C2%BE%22 and most of them were encoding problems. One English example is https://steffenthomas.org/about/about-the-museum/grant-to-green/ where the ¾¾ seem to be bullet points. This might even be a missing font or melon emoji used as bullet point.

Hope this clears a few things up.

jpluimers avatar Jul 05 '22 13:07 jpluimers