UTF8 Decoding Buffer Overflow Issues

Open quasar32 opened this issue 1 month ago • 0 comments

Currently the unicode decoding routines are only well behaved for well-formed code points. Ill formed code points can not only cause functions like utf8_decode in utf8.c to access bytes beyond the end of the string, but routines like status_ask_choice in utils.c will also access beyond the end of the string since they iterate through the string based on the leading code unit of each code point. Also the unicode routines assumes characters are unsigned, which is not the true on many platforms without the explicit use of -funsigned-char. My suggestion is to change the decoding routine in four ways:

The decoding routine should not assume the code points are well formed.
The decoding routine should progress through the string for the caller.
The decoding routine should return the replacement character (0xFFFD) instead of -1.
The decoding routine should not assume char is unsigned.

Here is an example implementation of a decoding routine that meets these requirements (not tested):

static const char lentbl[32] = {
	/*0XXXXXXX*/ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
	/*10XXXXXX*/ 0, 0, 0, 0, 0, 0, 0, 0,
	/*110XXXXX*/ 2, 2, 2, 2,
	/*1110XXXX*/ 3, 3,
	/*11110XXX*/ 4,
	/*11111XXX*/ 0,
};

long decode(const char *s, const char **end) {
	unsigned char c = *s;
	char n = lentbl[(c >> 3) & 31];
	long v = n > 0 : c & ((1 << (8 - n)) - 1) : 0xFFFD;
	while (n-- > 1) {
		v <<= 6;
		c = *++s;
		if ((c & 0xC0) != 0x80) {
			v = 0xFFFD;	
			break;
		}
		v |= c & 0x3F;
	}
	if (c != '\0') {
		++s;
	}
	if (end) {
		*end = s;
	}
	return v;
}

Nov 28 '25 21:11 quasar32