flex
flex copied to clipboard
yy_scan_bytes signed/unsigned conversion causes yyalloc failure on large strings
The generated lexer code converts a size_t to an int in yy_scan_string:
YY_BUFFER_STATE yy_scan_string (const char * yystr , yyscan_t yyscanner)
{
return yy_scan_bytes( yystr, (int) strlen(yystr) , yyscanner);
}
in yy_scan_bytes the int is converted back to a size_t:
YY_BUFFER_STATE yy_scan_bytes (const char * yybytes, int _yybytes_len , yyscan_t yyscanner)
{
YY_BUFFER_STATE b;
char *buf;
yy_size_t n;
int i;
/* Get memory for full buffer, including space for trailing EOB's. */
n = (yy_size_t) (_yybytes_len + 2);
buf = (char *) yyalloc( n , yyscanner );
if ( ! buf )
YY_FATAL_ERROR( "out of dynamic memory in yy_scan_bytes()" );
Here is a test:
size_t s = 2702888907;
cout << "original string size = " << s << endl;
int yybytes_len = (int)s;
cout << "original string size convert to int = " << yybytes_len << endl;
size_t n = (size_t) (yybytes_len + 2);
cout << "+2, and back to size_ t= " << n << endl;
and corresponding output:
original string size = 2702888907
original string size convert to int = -1592078389
+2, and back to size_ t= 18446744072117473229
It seems a 2^31 byte buffer is the largest buffer yy_scan_bytes can read.
The real issue isn't about signed/unsigned conversion. The issue is about the lack of overflow checking in yy_scan_bytes. yy_scan_bytes should reject the length that is (INT_MAX - 2) or greater. Note the YY_BUFFER_STATE object can store the buffer size of up to INT_MAX (by API design), and the conversion between int
and yy_size_t
shouldn't be a problem for that reason.
Perhaps we should deprecate yy_scan_bytes - the function is inherently unsafe against large inputs. It does not have bound checking. If a user app uses it to process untrusted inputs then it would draw security issues.
Correction: The deprecation is not necessary for yy_scan_bytes - but note that the function doesn't work well with large strings. And use it on trusted inputs only.
the YY_BUFFER_STATE object can store the buffer size of up to INT_MAX (by API design), and the conversion between int and yy_size_t shouldn't be a problem for that reason.
I understand your point.
Indeed conversion from unsigned, to signed and then back to unsigned should NOT be a problem.
However, what you have is a conversion from unsigned to signed, then +2, then back to unsigned. As a result, for an input of 2702888907, you have and output of 18446744072117473229. To fix this bug you need to convert signed to unsigned BEFORE adding 2.
@asokolsky You can't have an input of 2702888907 when the input data type is limited to int
. INT_MAX is 2147483647 for most Unix systems.
You have a point. Any chance you can replace yy_scan_bytes to use size_t for _yybytes_len?
@asokolsky That would break the API, and that's the point I was talking about.
Sure it could be implemented as an optional feature to change the size
data type of YY_BUFFER_STATE
to yy_size_t
, but that would break the API documented in the manual, and there is no practical use case that needs a buffer that's gigabytes large yet.