zbar Wrong text encoding assumption

This image

contain the text "Il était une fois, un noël radieiux et un gros test. Manchmal sind wir über freundlich."

but ZBar returns "Il 矇tait une fois, un no禱l radieiux et un gros test. Manchmal sind wir 羹ber freundlich.".

Jan 07 '22 01:01 hongquan

I was in this section of the code hacking on issue #237. I can see two things:

False-positive in text_is_big5()
The order of encodings in enc_list[] puts UTF at the end, and somehow your string works OK as SJIS.

Disabling text_is_big5 and reordering enc_list to put utf8_cd at the front yields a version that gives your desired result. Feel free to do that on your local machine. This needs attention from someone who knows more about character encoding than me, before any code changes get pushed to zbar master.

Nov 25 '22 01:11 ldoolitt

It looks that the same bug was already reported several years ago in https://sourceforge.net/p/zbar/bugs/73/ in the special case of accented characters.

I came across this bug after checking a QR code created from a vcard which used twice the German character ß. The QR code had been created this way:

cat "my_Vcard_with_GPG_fingerprint.vcf" | qrencode -s 3 -v 10 -o q.png

The created QR code was ok (according to my mobile phone's app BarcodeScsanner Version 4.7.8 and other Code scanner apps).

zbarimg q.png

however displayed the German letter ß as Chinese letter ﾃ歹. The OP already showed the bug for some french accented vovels and ligatures and for German Umlaut lower case ü. I would not be surprised if many or even all country specific characters, e.g.

the Danish Ø, the bolle-å from various scandinavian languages,
characters with diaeresis or trema, macron, accent grave, accent aigu, cedille,
guillemets (French quotation marks),
Czech haceks, Polish ogoneks,
long vovels in the Hungarian language,
Spanish ñ, ¿ and ¡ for Spanish and Portuese

all go wrong.

For some strange reason, zbarimg seems to use non UTF-8 output for characters other than pure ASCII characters. zbarimg should output as UTF-8 by default - or at least should be given an option to do so.

Mar 24 '23 14:03 melolontha-melolontha

i just ran into this as well, trying to verify QR codes i had created myself. they contain PGP signed meta data, and the non-UTF-8 decoding of umlaut characters now invalidates these signatures. a barcode reader app on my smartphone correctly decodes the QR codes, these signatures are valid as expected.

i've noticed that this issue also affects the GUI QtQR, as it relies on the python library.

i'm not sure autodetection of encodings can be done reliably at all. at least, UTF-8 should probably be the default, and there should be an encoding parameter to manually set the desired encoding (e.g., anything from iconv -l) to override autodetection in case it fails.

Mar 24 '23 15:03 unDocUMeantIt

I'm glad someone is looking at this. Y'all are probably the "someone who knows more about character encoding than me" mentioned above :-) If you actually try to modify the code in qrdectxt.c to fix encoding bugs, I humbly suggest you start with the version I made that's hanging out in MR #241. That copy fixes one easy-ish bug, and is much better formatted for maintenance.

This brings up a key point: are the project owners still around, so someone can actually accept merge requests into master?

Mar 24 '23 15:03 ldoolitt

This is (probably) the same issue in gnome Decoder. Summary:

Setting ZBAR_CFG_BINARY seems to be a (bad?) workaround. (At least by setting the "binary" checkbox in zbarcam_qt.)
Looks like the underlying encoding problem (Stackoverflow) is pretty ugly. (Some guessing is required.)

This Python session reproduces the mistake:

In [5]: 'Zürich'.encode('UTF8').decode('BIG5')
Out[5]: 'Z羹rich'
In [6]: 'Il était une fois, un noël'.encode('UTF8').decode('BIG5')
Out[6]: 'Il 矇tait une fois, un no禱l'

So, the issue seems that it prefers BIG-5 over UTF-8. (I haven't understood the logic in qrdectxt.c yet.) Not sure I like that assumption, but as per link above, it's possible that it's the correct order in some places of the world. (Certainly not in Zürich, though.)

Apr 03 '23 16:04 martinxyz

Setting ZBAR_CFG_BINARY seems to be a (bad?) workaround. (At least by setting the "binary" checkbox in zbarcam_qt.)

It is a good workaround. I implemented the binary decoding option by bypassing the built-in character encoding conversion. It just returns the data as-is so it can be decoded separately.

Care must be taken to decode every QR code individually though. Otherwise, you won't be able to tell where each QR code begins or ends.

Sep 28 '23 08:09 matheusmoreira

My concern is that it may be worse for case where the QR code actually has an encoding set. In this case it would be possible to convert it to text correctly no matter what, if the library does the conversion to text.

Sep 28 '23 10:09 martinxyz

zbar zbar copied to clipboard

Wrong text encoding assumption

zbar
zbar copied to clipboard