dmd
dmd copied to clipboard
Fix issue 22849: Global buffer overflow on lexer dereferencing
Signed-off-by: Luís Ferreira [email protected]
Related to https://github.com/dlang/dmd/pull/13765 . This is already being tested, but not triggered, since the CI test suite is not being run with a sanitiser.
Thanks for your pull request, @ljmf00!
Bugzilla references
| Auto-close | Bugzilla | Severity | Description |
|---|---|---|---|
| ✓ | 22849 | major | Global buffer overflow on lexer, dereferencing 4 bytes at the same time |
Testing this PR locally
If you don't have a local development environment setup, you can use Digger to test this PR:
dub run digger -- build "stable + dmd#13766"
ASAN gives me this when running dmd-unittest build rule. It seems the padding is not working correctly. I can investigate how padding is being done.
What test are you feeding the lexer?
From memory, files read into memory are given a 4 byte padding at the end ("\0\0\0\0") to prevent this from happening.
Ok, I can confirm, this error is when someone instantiates the lexer with crafted buffer, e.g. when using DMD as a library or using the lexer somewhere else, instead of using the FileBuffer. In the current approach, in the worst-case scenario, if the buffer is big and reallocating can't be linear, it will require moving some memory pages with a new allocation. This approach also requires the buffer to have that padding.
Is reallocating the buffer faster than verifying the pointer offset when hitting whitespace? I can't tell with a good level of certainty, but I do believe so, although there is another set of problems regarding passing, e.g., a stack buffer. Reallocating a stack buffer is undefined behaviour and possibly run into other security problems. This will obviously depend on how reallocation is implemented.
We can:
- Enforce the use of a buffer with padding (not a good approach, imho)
- Enforce the use of a mallocated/mem.malloc() buffer and check for possibility of buffer reallocation on instantiation
- Do this check and remove padding (probably not performant, it depends)
I guess this should be fixed, at least to mitigate the overflow.
I can study a better and faster way of lexing whitespace and possibly include tabs. I guess a good performance comparison will help for us to understand the best approach. I tried to search for this and can't find meaningful information.
I guess this should be fixed, at least to mitigate the overflow. I can study a better and faster way of lexing whitespace and possibly include tabs. I guess a good performance comparison will help for us to understand the best approach. I tried to search for this and can't find meaningful information.
Before doing any of this wrt performance we need a proper test suite to measure the lexers performance.
Previous work on fast lexing to be copied: Clang, simdjson.
I guess this should be fixed, at least to mitigate the overflow. I can study a better and faster way of lexing whitespace and possibly include tabs. I guess a good performance comparison will help for us to understand the best approach. I tried to search for this and can't find meaningful information.
Before doing any of this wrt performance we need a proper test suite to measure the lexers performance.
Previous work on fast lexing to be copied: Clang, simdjson.
Right. I think this can now be fixed temporarily. If you prefer I can add a comment explaining this.
FYI, in GCC they allocate a buffer with size + 16 bytes for the final \n and 15 bytes of padding used to quiet valgrind and AddressSanitizer warnings, as the optimized lexer path reads data in 16 byte aligned memory chunks.
There's also https://github.com/dlang/dmd/pull/13361 which fixes this issue from the other end: it ensures input buffers end with "\0\0\0\0"
This makes sense to me. Lexing + parsing is extremely fast in dmd and when compared to semantic analysis, the time is negligible. Now I know that "tons of small crabs eventually sink the ship", but having such hacks will probably hunt us down the path.
I guess this should be fixed, at least to mitigate the overflow.
I can study a better and faster way of lexing whitespace and possibly include tabs. I guess a good performance comparison will help for us to understand the best approach. I tried to search for this and can't find meaningful information.
@ljmf00 still working on this? Open to discuss things to better understand where these unpadded buffers come from, maybe there should be an interface which accepts foreign strings and copies into a padded buffer?
@ljmf00 still working on this? Open to discuss things to better understand where these unpadded buffers come from, maybe there should be an interface which accepts foreign strings and copies into a padded buffer?
Maybe that would be better but its fuzzy to afirm it. I think introducing this branching can hurt a bit performance, but what if reallocation costs more? A well predicted branch costs very few cycles but it will depend on how hot this branch is.
Maybe that would be better but its fuzzy to afirm it. I think introducing this branching can hurt a bit performance, but what if reallocation costs more? A well predicted branch costs very few cycles but it will depend on how hot this branch is.
I think I've already pointed it out here, but GCC considers skipping whitespace hot enough that there's inline asm implementations that do this. GCC lexer requires input buffers end with final \n and 15 bytes of padding - they align the source pointer and skip multiple forms of whitespace (not just spaces) 16 bytes at a time.
I think I've already pointed it out here, but GCC considers skipping whitespace hot enough that there's inline asm implementations that do this. GCC lexer requires input buffers end with final
\nand 15 bytes of padding - they align the source pointer and skip multiple forms of whitespace (not just spaces) 16 bytes at a time.
Ditto hashing of identifiers. We use murmur2 to hash identifiers - which is somewhere down the middle for speed vs collisions.
GCC goes even simpler than that: https://gcc.gnu.org/legacy-ml/gcc-patches/2001-08/msg01021.html