flex yy_scan_bytes signed/unsigned conversion causes yyalloc failure on large strings

yy_scan_bytes signed/unsigned conversion causes yyalloc failure on large strings

Open asokolsky opened this issue 4 years ago • 6 comments

The generated lexer code converts a size_t to an int in yy_scan_string:

YY_BUFFER_STATE yy_scan_string (const char * yystr , yyscan_t yyscanner)
{
   return yy_scan_bytes( yystr, (int) strlen(yystr) , yyscanner);
}

in yy_scan_bytes the int is converted back to a size_t:

YY_BUFFER_STATE yy_scan_bytes  (const char * yybytes, int  _yybytes_len , yyscan_t yyscanner)
{
        YY_BUFFER_STATE b;
        char *buf;
        yy_size_t n;
        int i;

        /* Get memory for full buffer, including space for trailing EOB's. */
        n = (yy_size_t) (_yybytes_len + 2);
        buf = (char *) yyalloc( n , yyscanner );
        if ( ! buf )
                YY_FATAL_ERROR( "out of dynamic memory in yy_scan_bytes()" );

Here is a test:

    size_t s = 2702888907;
    cout << "original string size = " << s << endl;
    int yybytes_len = (int)s;
    cout << "original string size convert to int = " << yybytes_len << endl;
    size_t n = (size_t) (yybytes_len + 2);
    cout << "+2, and back to size_ t= " << n << endl;

and corresponding output:

original string size = 2702888907
original string size convert to int = -1592078389
+2, and back to size_ t= 18446744072117473229

It seems a 2^31 byte buffer is the largest buffer yy_scan_bytes can read.

May 08 '20 17:05 asokolsky

The real issue isn't about signed/unsigned conversion. The issue is about the lack of overflow checking in yy_scan_bytes. yy_scan_bytes should reject the length that is (INT_MAX - 2) or greater. Note the YY_BUFFER_STATE object can store the buffer size of up to INT_MAX (by API design), and the conversion between int and yy_size_t shouldn't be a problem for that reason.

Perhaps we should deprecate yy_scan_bytes - the function is inherently unsafe against large inputs. It does not have bound checking. If a user app uses it to process untrusted inputs then it would draw security issues.

May 11 '20 10:05 Explorer09

Correction: The deprecation is not necessary for yy_scan_bytes - but note that the function doesn't work well with large strings. And use it on trusted inputs only.

May 11 '20 11:05 Explorer09

the YY_BUFFER_STATE object can store the buffer size of up to INT_MAX (by API design), and the conversion between int and yy_size_t shouldn't be a problem for that reason.

I understand your point.

Indeed conversion from unsigned, to signed and then back to unsigned should NOT be a problem.

However, what you have is a conversion from unsigned to signed, then +2, then back to unsigned. As a result, for an input of 2702888907, you have and output of 18446744072117473229. To fix this bug you need to convert signed to unsigned BEFORE adding 2.

May 11 '20 16:05 asokolsky

@asokolsky You can't have an input of 2702888907 when the input data type is limited to int. INT_MAX is 2147483647 for most Unix systems.

May 12 '20 01:05 Explorer09

You have a point. Any chance you can replace yy_scan_bytes to use size_t for _yybytes_len?

May 12 '20 20:05 asokolsky

@asokolsky That would break the API, and that's the point I was talking about. Sure it could be implemented as an optional feature to change the size data type of YY_BUFFER_STATE to yy_size_t, but that would break the API documented in the manual, and there is no practical use case that needs a buffer that's gigabytes large yet.

May 13 '20 08:05 Explorer09

flex flex copied to clipboard

yy_scan_bytes signed/unsigned conversion causes yyalloc failure on large strings

flex
flex copied to clipboard