dmd icon indicating copy to clipboard operation
dmd copied to clipboard

Fix issue 22849: Global buffer overflow on lexer dereferencing

Open ljmf00 opened this issue 2 years ago • 8 comments

Signed-off-by: Luís Ferreira [email protected]


Related to https://github.com/dlang/dmd/pull/13765 . This is already being tested, but not triggered, since the CI test suite is not being run with a sanitiser.

ljmf00 avatar Mar 05 '22 17:03 ljmf00

Thanks for your pull request, @ljmf00!

Bugzilla references

Auto-close Bugzilla Severity Description
22849 major Global buffer overflow on lexer, dereferencing 4 bytes at the same time

Testing this PR locally

If you don't have a local development environment setup, you can use Digger to test this PR:

dub run digger -- build "stable + dmd#13766"

dlang-bot avatar Mar 05 '22 17:03 dlang-bot

ASAN gives me this when running dmd-unittest build rule. It seems the padding is not working correctly. I can investigate how padding is being done.

ljmf00 avatar Mar 06 '22 13:03 ljmf00

What test are you feeding the lexer?

From memory, files read into memory are given a 4 byte padding at the end ("\0\0\0\0") to prevent this from happening.

Ok, I can confirm, this error is when someone instantiates the lexer with crafted buffer, e.g. when using DMD as a library or using the lexer somewhere else, instead of using the FileBuffer. In the current approach, in the worst-case scenario, if the buffer is big and reallocating can't be linear, it will require moving some memory pages with a new allocation. This approach also requires the buffer to have that padding.

Is reallocating the buffer faster than verifying the pointer offset when hitting whitespace? I can't tell with a good level of certainty, but I do believe so, although there is another set of problems regarding passing, e.g., a stack buffer. Reallocating a stack buffer is undefined behaviour and possibly run into other security problems. This will obviously depend on how reallocation is implemented.

We can:

  1. Enforce the use of a buffer with padding (not a good approach, imho)
  2. Enforce the use of a mallocated/mem.malloc() buffer and check for possibility of buffer reallocation on instantiation
  3. Do this check and remove padding (probably not performant, it depends)

ljmf00 avatar Mar 06 '22 17:03 ljmf00

I guess this should be fixed, at least to mitigate the overflow.

I can study a better and faster way of lexing whitespace and possibly include tabs. I guess a good performance comparison will help for us to understand the best approach. I tried to search for this and can't find meaningful information.

ljmf00 avatar Mar 15 '22 16:03 ljmf00

I guess this should be fixed, at least to mitigate the overflow. I can study a better and faster way of lexing whitespace and possibly include tabs. I guess a good performance comparison will help for us to understand the best approach. I tried to search for this and can't find meaningful information.

Before doing any of this wrt performance we need a proper test suite to measure the lexers performance.

Previous work on fast lexing to be copied: Clang, simdjson.

maxhaton avatar Mar 15 '22 17:03 maxhaton

I guess this should be fixed, at least to mitigate the overflow. I can study a better and faster way of lexing whitespace and possibly include tabs. I guess a good performance comparison will help for us to understand the best approach. I tried to search for this and can't find meaningful information.

Before doing any of this wrt performance we need a proper test suite to measure the lexers performance.

Previous work on fast lexing to be copied: Clang, simdjson.

Right. I think this can now be fixed temporarily. If you prefer I can add a comment explaining this.

ljmf00 avatar Mar 15 '22 17:03 ljmf00

FYI, in GCC they allocate a buffer with size + 16 bytes for the final \n and 15 bytes of padding used to quiet valgrind and AddressSanitizer warnings, as the optimized lexer path reads data in 16 byte aligned memory chunks.

ibuclaw avatar Mar 17 '22 21:03 ibuclaw

There's also https://github.com/dlang/dmd/pull/13361 which fixes this issue from the other end: it ensures input buffers end with "\0\0\0\0"

dkorpel avatar Apr 01 '22 09:04 dkorpel

This makes sense to me. Lexing + parsing is extremely fast in dmd and when compared to semantic analysis, the time is negligible. Now I know that "tons of small crabs eventually sink the ship", but having such hacks will probably hunt us down the path.

I guess this should be fixed, at least to mitigate the overflow.

I can study a better and faster way of lexing whitespace and possibly include tabs. I guess a good performance comparison will help for us to understand the best approach. I tried to search for this and can't find meaningful information.

ljmf00 avatar Oct 11 '22 08:10 ljmf00

@ljmf00 still working on this? Open to discuss things to better understand where these unpadded buffers come from, maybe there should be an interface which accepts foreign strings and copies into a padded buffer?

ibuclaw avatar Jan 15 '23 15:01 ibuclaw

@ljmf00 still working on this? Open to discuss things to better understand where these unpadded buffers come from, maybe there should be an interface which accepts foreign strings and copies into a padded buffer?

Maybe that would be better but its fuzzy to afirm it. I think introducing this branching can hurt a bit performance, but what if reallocation costs more? A well predicted branch costs very few cycles but it will depend on how hot this branch is.

ljmf00 avatar Jan 26 '23 23:01 ljmf00

Maybe that would be better but its fuzzy to afirm it. I think introducing this branching can hurt a bit performance, but what if reallocation costs more? A well predicted branch costs very few cycles but it will depend on how hot this branch is.

I think I've already pointed it out here, but GCC considers skipping whitespace hot enough that there's inline asm implementations that do this. GCC lexer requires input buffers end with final \n and 15 bytes of padding - they align the source pointer and skip multiple forms of whitespace (not just spaces) 16 bytes at a time.

ibuclaw avatar Jan 26 '23 23:01 ibuclaw

I think I've already pointed it out here, but GCC considers skipping whitespace hot enough that there's inline asm implementations that do this. GCC lexer requires input buffers end with final \n and 15 bytes of padding - they align the source pointer and skip multiple forms of whitespace (not just spaces) 16 bytes at a time.

Ditto hashing of identifiers. We use murmur2 to hash identifiers - which is somewhere down the middle for speed vs collisions.

GCC goes even simpler than that: https://gcc.gnu.org/legacy-ml/gcc-patches/2001-08/msg01021.html

ibuclaw avatar Jan 27 '23 00:01 ibuclaw