libpostal
libpostal copied to clipboard
BUG: invalid characters lead `tokenize_add_tokens` to endless loop and OOM
Hi!
I was checking out libpostal, and found a endless looping keeps mallocing memory which leads to OOM finally, and can be a serious problem in production environment (e.g. DOS).
Here's how I'm using libpostal
The source code:
...
unsigned char buf[2] = {0xff, 0x00}; // ====> invalid character, value between 128-255
libpostal_normalize_options_t options;
size_t num_expansions;
char **expansions;
options = libpostal_get_default_options();
expansions = libpostal_expand_address((char*)buf, options, &num_expansions); // ====> endless looping here, OOM
...
Here's what I got
Here is the function call trace, the loop in token_array_push
is endless while pushing entry to vector which leads to OOM.
#9 0xf23a59 in token_array_push libpostal/BUILD/src/./tokens.h:19:1
#10 0xf23675 in tokenize_add_tokens libpostal/BUILD/src/scanner.re:251:9
#11 0xf23ce2 in tokenize_keep_whitespace libpostal/BUILD/src/scanner.re:260:5
#12 0x56f28d in expand_alternative_phrase_option libpostal/BUILD/src/expand.c:1418:27
#13 0x571bc9 in expand_address_phrase_option libpostal/BUILD/src/expand.c:1590:13
#14 0x57226f in expand_address libpostal/BUILD/src/expand.c:1619:12
#15 0x557c3d in libpostal_expand_address libpostal/BUILD/src/libpostal.c:51:30
Here's what I was expecting
Anything but not OOM
Here's what I think could be improved
Tokenize code here should be imporved:
void tokenize_add_tokens(token_array *tokens, const char *input, size_t len, bool keep_whitespace) {
scanner_t scanner = scanner_from_string(input, len);
...
size_t consumed = 0;
// scanner.cursor never steps forward
while (consumed < len && (token_type = scan_token(&scanner)) != END) {
token_start = scanner.start - scanner.src;
token_length = scanner.cursor - scanner.start;
if ((token_type == WHITESPACE && !keep_whitespace) || (token_type == INVALID_CHAR)) {
continue;
}
...
token_array_push(tokens, token); // ===> keep pushing
consumed += token_length; // ===> 'token_length' always equals to 0
}
}
Check if scanner.cursor
equals to scanner.start
, make sure we can break out of the token scanning loop in tokenize_add_tokens
.