flex icon indicating copy to clipboard operation
flex copied to clipboard

yy_scan_bytes signed/unsigned conversion causes yyalloc failure on large strings

Open asokolsky opened this issue 4 years ago • 6 comments

The generated lexer code converts a size_t to an int in yy_scan_string:

YY_BUFFER_STATE yy_scan_string (const char * yystr , yyscan_t yyscanner)
{
   return yy_scan_bytes( yystr, (int) strlen(yystr) , yyscanner);
}

in yy_scan_bytes the int is converted back to a size_t:

YY_BUFFER_STATE yy_scan_bytes  (const char * yybytes, int  _yybytes_len , yyscan_t yyscanner)
{
        YY_BUFFER_STATE b;
        char *buf;
        yy_size_t n;
        int i;

        /* Get memory for full buffer, including space for trailing EOB's. */
        n = (yy_size_t) (_yybytes_len + 2);
        buf = (char *) yyalloc( n , yyscanner );
        if ( ! buf )
                YY_FATAL_ERROR( "out of dynamic memory in yy_scan_bytes()" );

Here is a test:

    size_t s = 2702888907;
    cout << "original string size = " << s << endl;
    int yybytes_len = (int)s;
    cout << "original string size convert to int = " << yybytes_len << endl;
    size_t n = (size_t) (yybytes_len + 2);
    cout << "+2, and back to size_ t= " << n << endl;

and corresponding output:

original string size = 2702888907
original string size convert to int = -1592078389
+2, and back to size_ t= 18446744072117473229

It seems a 2^31 byte buffer is the largest buffer yy_scan_bytes can read.

asokolsky avatar May 08 '20 17:05 asokolsky

The real issue isn't about signed/unsigned conversion. The issue is about the lack of overflow checking in yy_scan_bytes. yy_scan_bytes should reject the length that is (INT_MAX - 2) or greater. Note the YY_BUFFER_STATE object can store the buffer size of up to INT_MAX (by API design), and the conversion between int and yy_size_t shouldn't be a problem for that reason.

Perhaps we should deprecate yy_scan_bytes - the function is inherently unsafe against large inputs. It does not have bound checking. If a user app uses it to process untrusted inputs then it would draw security issues.

Explorer09 avatar May 11 '20 10:05 Explorer09

Correction: The deprecation is not necessary for yy_scan_bytes - but note that the function doesn't work well with large strings. And use it on trusted inputs only.

Explorer09 avatar May 11 '20 11:05 Explorer09

the YY_BUFFER_STATE object can store the buffer size of up to INT_MAX (by API design), and the conversion between int and yy_size_t shouldn't be a problem for that reason.

I understand your point.

Indeed conversion from unsigned, to signed and then back to unsigned should NOT be a problem.

However, what you have is a conversion from unsigned to signed, then +2, then back to unsigned. As a result, for an input of 2702888907, you have and output of 18446744072117473229. To fix this bug you need to convert signed to unsigned BEFORE adding 2.

asokolsky avatar May 11 '20 16:05 asokolsky

@asokolsky You can't have an input of 2702888907 when the input data type is limited to int. INT_MAX is 2147483647 for most Unix systems.

Explorer09 avatar May 12 '20 01:05 Explorer09

You have a point. Any chance you can replace yy_scan_bytes to use size_t for _yybytes_len?

asokolsky avatar May 12 '20 20:05 asokolsky

@asokolsky That would break the API, and that's the point I was talking about. Sure it could be implemented as an optional feature to change the size data type of YY_BUFFER_STATE to yy_size_t, but that would break the API documented in the manual, and there is no practical use case that needs a buffer that's gigabytes large yet.

Explorer09 avatar May 13 '20 08:05 Explorer09