flex yyinput is broken in buffer-only scenarios (using yywrap to switch buffers)

yyinput() is broken when using yy_scan_buffer() (i.e. yyin == NULL) in conjunction with yywrap().

The problematic call sequence seems to be as follows: yy_scan_buffer() sets yy_fill_buffer to 0, but this flag is reset to 1 in yy_init_buffer(), which is called via yyrestart() (called in turn via yy_get_next_buffer()). Now, when yy_fill_buffer equals 1, YY_INPUT (i.e. read(fileno (yyin))) is invoked in yy_get_next_buffer (), which leads to a crash.

One possible fix might be to only set yy_fill_buffer to 1 in yy_init_buffer() iff yyin is not NULL. I am not sure however, whether there are other issues to consider besides setting yy_fill_buffer.

May 14 '22 19:05 esohns

Do you have a test case or a scanner I can use to build a test case for the regression tests? Should be an easy fix and those would really help make sure it sticks.

May 15 '22 16:05 Mightyjo

sorry; in the mean time I have adapted my scanner to not use yyinput at all. I was using yyinput to skip over data bytes in the parsed buffer(s) for efficiency reasons; now I parse every single byte instead.

One example use case would be to match some characters and then repeatedly invoke yyinput inside of the rule block until yywrap gets invoked. This should trigger the crash; not 100% sure about this though.

I still hope we can get this fixed so I can revert back to a version with better performance.

May 15 '22 19:05 esohns

I have cobbled together a scanner that should expose the issue. Note that it still contains some pseudo-code. Let me know if you need more details.

bug_yyinput.zip

May 15 '22 19:05 esohns

Can you check that this new test captures the behavior you reported? Just want to be sure I'm not fixing another bug altogether :)

May 16 '22 17:05 Mightyjo

Assuming my example matches your problem, this is more difficult to solve than I thought. I coded up your suggestion in yy_init_buffer but it causes early termination with the yywrap implementation in the test. (I need to fix the spacing. Take a look at src/c99-flex.skl, too.) No SEGV, at least!

You might have to let yylex return to main, set up the next buffer, then call yylex again. Testing that.

May 17 '22 00:05 Mightyjo

Ok, got the test to stress the execution path that was segfaulting several times without returning to main. The link in my previous comment points to it now.

If that's doing what you mean, the fix to the skeleton makes the failure graceful, but you still have to check for the end of the buffer in your scanner somewhere.

May 17 '22 01:05 Mightyjo

Yes, the test you wrote (first comment) exactly captures what I had in mind. As you can see, this use-case seems to be broken in current builds.

Regarding your other comments: I can see that you check the return value of yyinput() for 0 and then call yywrap manually in the rule block. I think that flex should catch the end of the buffer itself and invoke yywrap automatically in that case; I am pretty sure there is already code for this in the skeleton file, just unsure how to make it behave 'regularly' for yyinput. yylex should then only return when a rule block returns or when yywrap returns 1(, or some other (internal) error occurs, I guess).

I have dug around the code some more and I found that yy_scan_buffer resets yy_is_our_buffer to 0. Perhaps a test for this flag (yy_is_our_buffer == 0) makes more sense than the test for yyin == NULL ? For example, I see that yylex sets yyin to stdin by default if it isn't set, so a test for that later (in yy_init_buffer) would probably fail anyway.

Thank you for your effort. While your implementation looks like a feasible workaround, it is not yet exactly what I had hoped for.

May 17 '22 19:05 esohns

It works without calling yywrap in the user code, but when flex does the call it's followed by yyrestart and the rule has to be matched again so you see ECHO called for each buffer. Either way tests the problem code path.

Testing yyin == NULL is the right thing to do. It only happens when user code bypasses the default initialization that would have set up stdin. yy_is_our_buffer is used by the scanner's memory manager and I don't want to overload it's meaning.

The main thing I've fixed is the SEGV if you don't check the return value of yyinput. You'll just get a scanner failure now because it hits EOF unexpectedly.

May 17 '22 21:05 Mightyjo

DOH! I'm using %option prefix="test" and pushed code without yywrap renamed. Not really sure why that's working at all.

May 17 '22 21:05 Mightyjo

Okay, updated the test to use the automatic yywrap behavior. You still have to check whether you've gone past your last buffer somehow. I used the do-it-yourself example from the Multiple Input Buffers section of the manual. You could instead prepare all your buffers on the stack and have yywrap pop the stack to move to the next one.

I badly needed coffee earlier. I can use yywrap and testwrap interchangeably inside the module produced from my .l file because -of course - flex inserted the #defines above my code. That would not necessarily be true with testwrap and main implemented in a separate .c file. I updated their definitions to be consistent with my options.

May 18 '22 00:05 Mightyjo

flex flex copied to clipboard

yyinput is broken in buffer-only scenarios (using yywrap to switch buffers)

flex
flex copied to clipboard