Feature: detect mixups between two single-byte encodings
There is apparently a fair amount of Spanish text out there that contains a mix-up between Windows-1252 and MacRoman before being encoded in UTF-8.
Because Latin-1 for Windows-1252 is the only single-byte mixup we detect, we assume that's what happened, and get text that looks like: "PrevŽn diputados inaugurar periodo de sesiones con c—digo penal".
This is not a false positive, because the encoding is in fact incorrect (it's actually got the UTF-8 encoding of the wrong characters in it), and ftfy is trying to fix it. It's in fact using the same fix that any web browser would use. However, the resulting text makes no sense, because it's not the correct fix.
This mixup is apparently common enough that it would be worth fixing as another special case.
Is this the same issue or a new one?
>>> s = u'Radio central���Hazme un Instrumento de tu paz����91.9.FM La radio de paz.�,'
>>> print ftfy.fix_text_segment(s)
Radio central���Hazme un Instrumento de tu paz����91.9.FM La radio de paz.�,
Source: http://184.107.166.66:8114/status.xsl
That one's not an issue. Beneath the mojibake, that's exactly what the text says.
� in Windows-1252 is 0xEF 0xBF 0xBD, the UTF-8 encoding of �, aka U+FFFD REPLACEMENT CHARACTER. Whatever actual Unicode the string was supposed to contain has already been lost.
There are several open issues that are really the same thing. I'm merging them all into this issue.
@rspeer Cool! Let me know whether you'd like me to keep posting examples as I find them. I want to help but I don't want to spam :)
The examples are helpful! I can use them as test cases.
Related: the mixup of "v3/43/4r" (ASCII-printed high-byte characters) coming from "v¾¾r" (CP850) coming from "vóór" (Windows-1252). See http://stackoverflow.com/questions/17654898/which-encoding-failure-did-encode-v%C3%B3%C3%B3r-into-v3-43-4r
Man. That's an unfortunate mix-up. But it's not one ftfy should fix, because pure ASCII is not something to be messed with.
I should, however, look into "the infamous CP850" and whether ftfy should consider it as a possibility, so that for example it could decode UTF-8 re-interpreted as CP850.
What about this:
a = '''Liège Avenue de l'Hôpital''' # french sentence
print(ftfy.fix_text(a.decode('utf8')))
# Out: Liège Avenue de l'HĂ´pital, no change from input, where it should be: Liège Avenue de l'Hôpital
Does this fit into this issue? I could not find any way to correct this (using ftfy or any other method).
It's been encoded in UTF-8 and decoded in Windows-1250. Here's the code that specifically fixes it (written in a way that should work in Python 2 or 3):
>>> text = u"Liège Avenue de l'Hôpital"
>>> print(text.encode('windows-1250').decode('utf-8'))
Liège Avenue de l'Hôpital
So this is within the scope of ftfy, it's just not a possibility that it currently checks for. I'm aware that Windows-1250 is used somewhat frequently in Eastern Europe, and it's probably a bias in my data collection that I haven't seen many examples of it.
I will open a new issue for this.
If we have something like this that's not problem
>>> print(ftfy.fix_text('ünicode'))
ünicode
But if we use mixed encoding types something like this i.e
>>> print(ftfy.fix_text('Hi to ℙℽ☂ℌϕℿ ünicode'))
Hi to ℙℽ☂ℌϕℿ ünicode
Expected to be(Hi to ℙℽ☂ℌϕℿ ünicode)
Why is this happening? Is this something that this library cannot handle?
ftfy makes kind of arbitrary decisions about how to handle mixed encodings: it allows the encoding to change at line breaks, and it also decodes the most common mojibake sequences like • even when they're inconsistent with the surrounding line.
Encoding a combining umlaut as ̈ isn't common enough to fall into that second case.
Man. That's an unfortunate mix-up. But it's not one ftfy should fix, because pure ASCII is not something to be messed with.
I should, however, look into "the infamous CP850" and whether ftfy should consider it as a possibility, so that for example it could decode UTF-8 re-interpreted as CP850.
Do you want me to make a separate issue out of https://github.com/LuminosoInsight/python-ftfy/issues/18#issuecomment-126046344 and point back to https://github.com/LuminosoInsight/python-ftfy/issues/18 ?
@rspeer I found back an offending document for "v3/43/4r" at Wayback] g428-1.pdf
Do you want me to make a separate issue out of it?
@jpluimers I can't tell what the Unicode issue you're linking is -- what's the mojibaked text, and can you tell which encodings were mixed up?
trying to follow this: do you have a place where it says "v¾¾r" and hasn't been flattened into "v3/43/4r"?
trying to follow this: do you have a place where it says "v¾¾r" and hasn't been flattened into "v3/43/4r"?
Didn't have time to dig for this earlier, as there were too many "5-minute" things that ended up taking loads more time.
I should have rescheduled, as https://www.google.com/search?q=%22v%C2%BE%C2%BEr%22 was indeed less than a "5 minute thing". A few of the results:
- https://commons.wikimedia.org/wiki/File:Overdracht_van_goederen_voor_Griekenland._Het_verhuisbedrijf_Beck_v%C2%BE%C2%BEr_het_stadhuis_jan._1951_GN15237.jpeg (in the title)
- https://disco.datawonen.nl/variable?versieid=WOON15V2&variablecode=ZelfdeHH (in one of the first headings)
- https://www.kroonint.nl/bakta-tabletten-6x250-stuks (in the body text)
Then came the hard part, trying to make sense of what ftfy can do (:
For all the above texts, https://ftfy.vercel.app/ wrongly comes up with something like this (note I pasted v¾¾r)
s = 'v¾¾r'
s = s.encode('latin-1')
s = s.decode('utf-8')
print(s)
You can very the wrong handling using for instance https://ftfy.vercel.app/?s=v%C2%BE%C2%BEr
I verified this at https://www.python.org/shell/:
Python 3.9.5 (default, May 27 2021, 19:45:35)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import os
>>> os.system('sh')
$ pip3 install ftfy
Defaulting to user installation because normal site-packages is not writeable
Looking in links: /usr/share/pip-wheels
Collecting ftfy
Downloading ftfy-6.1.1-py3-none-any.whl (53 kB)
|████████████████████████████████| 53 kB 2.8 MB/s
Requirement already satisfied: wcwidth>=0.2.5 in /usr/local/lib/python3.9/site-packages (from ftfy) (0.2.5)
Installing collected packages: ftfy
Successfully installed ftfy-6.1.1
$ python
Python 3.9.5 (default, May 27 2021, 19:45:35)
[GCC 9.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import ftfy
>>> s = 'v¾¾r'
>>> s = s.encode('latin-1')
>>> s = s.decode('utf-8')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'utf-8' codec can't decode byte 0xbe in position 1: invalid start byte
>>> print(s)
b'v\xbe\xber'
>>>
On the console, it does not get recognised either in the same Python session:
>>> ftfy.fix_and_explain("v¾¾r")
ExplainedText(text='v¾¾r', explanation=[])
>>>
But it is indeed "the infamous CP850", as when I continue this in the same Python session:
>>> s = 'v¾¾r'
>>> s = s.encode('cp850')
>>> s = s.decode('latin-1')
>>> print (s)
vóór
>>>
If you could have ftfy search for 3/43/4 then treat it like ¾¾ and recognise it as a potential CP3850 problem, especially in Dutch texts.
Even for non-Dutch texts, ¾¾ is often an encoding problem. I browsed through the first 10 pages of results for https://www.google.com/search?q=%22%C2%BE%C2%BE%22 and most of them were encoding problems. One English example is https://steffenthomas.org/about/about-the-museum/grant-to-green/ where the ¾¾ seem to be bullet points. This might even be a missing font or melon emoji used as bullet point.
Hope this clears a few things up.