instascan icon indicating copy to clipboard operation
instascan copied to clipboard

Encoding issues / Umlaut is not decoded correctly

Open TomRauchenwald38 opened this issue 6 years ago • 9 comments

I have trouble decoding the QR code from this PDF (on page 27). It seems the Umlaut in the last line is not decoded correctly. Screenshot from the live demo: image The last line should read ..."für Gartenarbeit und Entsorgung"...

I can decode the QR Code just fine in Java using ZXing. If I set the the CHARACTER_SET decoding hint to "ISO-8859-1" the decoded result is exactly the same as pictured in the screenshot, so I suspect that somewhere ISO-8859-1 is assumed in InstaScan.

Here's the QR Code I used for easier copy/pasting: qr_sample_1

Is there a way to specify the encoding to use, or is this a bug?

TomRauchenwald38 avatar Jan 02 '18 09:01 TomRauchenwald38

In PHP, use: utf8_decode Thsi converts the string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1

dieperie avatar Jan 18 '18 12:01 dieperie

In javascript, the following to to the same:

var decoded_content = self.utf8_decode(content); self.scans.unshift({ date: +(Date.now()), content: decoded_content });

utf8_decode: function (str_data) { // Converts a string with ISO-8859-1 characters encoded with UTF-8 to single-byte ISO-8859-1 var string = "", i = 0, c = c1 = c2 = 0;

	while ( i < str_data.length ) {
		c = str_data.charCodeAt(i);
		if (c < 128) {
			string += String.fromCharCode(c);
			i++;
		} else if((c > 191) && (c < 224)) {
			c2 = str_data.charCodeAt(i+1);
			string += String.fromCharCode(((c & 31) << 6) | (c2 & 63));
			i += 2;
		} else {
			c2 = str_data.charCodeAt(i+1);
			c3 = str_data.charCodeAt(i+2);
			string += String.fromCharCode(((c & 15) << 12) | ((c2 & 63) << 6) | (c3 & 63));
			i += 3;
		}
	}
	return string;

dieperie avatar Jan 18 '18 20:01 dieperie

Having the same issue. Cyrillics are decoded into gibberish:

Данный купон сгенерирован

yamnikov-oleg avatar Feb 02 '18 11:02 yamnikov-oleg

having same issues with korean language

fariskas avatar Sep 06 '19 06:09 fariskas

Having the same issue. Cyrillics are decoded into gibberish:

�анн�й к�пон �гене�и�ован

Проблема с этом куске https://github.com/schmich/instascan/blob/b0f9519f2dd2a6661e67066d6ed678e621dd5ce2/src/scanner.js#L101 но я пока еще не разобрался как это пофиксить.

alekciy avatar May 09 '20 13:05 alekciy

@alekciy Thank you for the tip, I have added utf8 decoder in that line and it worked.

yamnikov-oleg avatar May 09 '20 18:05 yamnikov-oleg

Though this might not get merged. In case somebody needs this fix, you can clone the repo, apply the fix yourself and rebuild the package with:

npm install
./node_modules/.bin/gulp release

The instascan.min.js will appear in dist directory.

yamnikov-oleg avatar May 09 '20 18:05 yamnikov-oleg

@alekciy Thank you for the tip, I have added utf8 decoder in that line and it worked.

А если cp1251? Например, платежки по ГОСТ Р 56042-2014 формат ST00011. В идеале добавить бы детектор кодировки.

alekciy avatar May 10 '20 04:05 alekciy

@alekciy I don't think there is a reliable way to detect text encoding, especially when it's CP encodings. It would probably be better to add an encoding parameter to the Scanner class.

yamnikov-oleg avatar May 10 '20 07:05 yamnikov-oleg