zbar
zbar copied to clipboard
Wrong text encoding assumption
This image
contain the text "Il était une fois, un noël radieiux et un gros test. Manchmal sind wir über freundlich."
but ZBar returns "Il 矇tait une fois, un no禱l radieiux et un gros test. Manchmal sind wir 羹ber freundlich.".
I was in this section of the code hacking on issue #237. I can see two things:
- False-positive in text_is_big5()
- The order of encodings in enc_list[] puts UTF at the end, and somehow your string works OK as SJIS.
Disabling text_is_big5 and reordering enc_list to put utf8_cd at the front yields a version that gives your desired result. Feel free to do that on your local machine. This needs attention from someone who knows more about character encoding than me, before any code changes get pushed to zbar master.
It looks that the same bug was already reported several years ago in https://sourceforge.net/p/zbar/bugs/73/ in the special case of accented characters.
I came across this bug after checking a QR code created from a vcard which used twice the German character ß. The QR code had been created this way:
cat "my_Vcard_with_GPG_fingerprint.vcf" | qrencode -s 3 -v 10 -o q.png
The created QR code was ok (according to my mobile phone's app BarcodeScsanner Version 4.7.8 and other Code scanner apps).
zbarimg q.png
however displayed the German letter ß as Chinese letter テ歹. The OP already showed the bug for some french accented vovels and ligatures and for German Umlaut lower case ü. I would not be surprised if many or even all country specific characters, e.g.
-
the Danish Ø, the bolle-å from various scandinavian languages,
-
characters with diaeresis or trema, macron, accent grave, accent aigu, cedille,
-
guillemets (French quotation marks),
-
Czech haceks, Polish ogoneks,
-
long vovels in the Hungarian language,
-
Spanish ñ, ¿ and ¡ for Spanish and Portuese
all go wrong.
For some strange reason, zbarimg
seems to use non UTF-8 output for characters other than pure ASCII characters. zbarimg
should output as UTF-8 by default - or at least should be given an option to do so.
i just ran into this as well, trying to verify QR codes i had created myself. they contain PGP signed meta data, and the non-UTF-8 decoding of umlaut characters now invalidates these signatures. a barcode reader app on my smartphone correctly decodes the QR codes, these signatures are valid as expected.
i've noticed that this issue also affects the GUI QtQR, as it relies on the python library.
i'm not sure autodetection of encodings can be done reliably at all. at least, UTF-8 should probably be the default, and there should be an encoding parameter to manually set the desired encoding (e.g., anything from iconv -l
) to override autodetection in case it fails.
I'm glad someone is looking at this. Y'all are probably the "someone who knows more about character encoding than me" mentioned above :-)
If you actually try to modify the code in qrdectxt.c
to fix encoding bugs, I humbly suggest you start with the version I made that's hanging out in MR #241. That copy fixes one easy-ish bug, and is much better formatted for maintenance.
This brings up a key point: are the project owners still around, so someone can actually accept merge requests into master?
This is (probably) the same issue in gnome Decoder. Summary:
- Setting ZBAR_CFG_BINARY seems to be a (bad?) workaround. (At least by setting the "binary" checkbox in zbarcam_qt.)
- Looks like the underlying encoding problem (Stackoverflow) is pretty ugly. (Some guessing is required.)
This Python session reproduces the mistake:
In [5]: 'Zürich'.encode('UTF8').decode('BIG5')
Out[5]: 'Z羹rich'
In [6]: 'Il était une fois, un noël'.encode('UTF8').decode('BIG5')
Out[6]: 'Il 矇tait une fois, un no禱l'
So, the issue seems that it prefers BIG-5 over UTF-8. (I haven't understood the logic in qrdectxt.c yet.) Not sure I like that assumption, but as per link above, it's possible that it's the correct order in some places of the world. (Certainly not in Zürich, though.)
- Setting ZBAR_CFG_BINARY seems to be a (bad?) workaround. (At least by setting the "binary" checkbox in zbarcam_qt.)
It is a good workaround. I implemented the binary decoding option by bypassing the built-in character encoding conversion. It just returns the data as-is so it can be decoded separately.
Care must be taken to decode every QR code individually though. Otherwise, you won't be able to tell where each QR code begins or ends.
My concern is that it may be worse for case where the QR code actually has an encoding set. In this case it would be possible to convert it to text correctly no matter what, if the library does the conversion to text.