libpostal icon indicating copy to clipboard operation
libpostal copied to clipboard

BUG: invalid characters lead `tokenize_add_tokens` to endless loop and OOM

Open SunHao-0 opened this issue 3 years ago • 0 comments

Hi!

I was checking out libpostal, and found a endless looping keeps mallocing memory which leads to OOM finally, and can be a serious problem in production environment (e.g. DOS).


Here's how I'm using libpostal

The source code:

    ...
    unsigned char buf[2] = {0xff, 0x00};     //   ====> invalid character, value between 128-255
    libpostal_normalize_options_t options;
    size_t num_expansions;
    char **expansions;

    options = libpostal_get_default_options();  
    expansions = libpostal_expand_address((char*)buf, options, &num_expansions);  // ====> endless looping here, OOM
    ...

Here's what I got

Here is the function call trace, the loop in token_array_push is endless while pushing entry to vector which leads to OOM.

    #9 0xf23a59 in token_array_push libpostal/BUILD/src/./tokens.h:19:1
    #10 0xf23675 in tokenize_add_tokens libpostal/BUILD/src/scanner.re:251:9
    #11 0xf23ce2 in tokenize_keep_whitespace libpostal/BUILD/src/scanner.re:260:5
    #12 0x56f28d in expand_alternative_phrase_option libpostal/BUILD/src/expand.c:1418:27
    #13 0x571bc9 in expand_address_phrase_option libpostal/BUILD/src/expand.c:1590:13
    #14 0x57226f in expand_address libpostal/BUILD/src/expand.c:1619:12
    #15 0x557c3d in libpostal_expand_address libpostal/BUILD/src/libpostal.c:51:30

Here's what I was expecting

Anything but not OOM


Here's what I think could be improved

Tokenize code here should be imporved:

void tokenize_add_tokens(token_array *tokens, const char *input, size_t len, bool keep_whitespace) {
    scanner_t scanner = scanner_from_string(input, len);
    ...
    size_t consumed = 0;

    // scanner.cursor never steps forward
    while (consumed < len && (token_type = scan_token(&scanner)) != END) {
        token_start = scanner.start - scanner.src;
        token_length = scanner.cursor - scanner.start;   
     
        if ((token_type == WHITESPACE && !keep_whitespace) || (token_type == INVALID_CHAR)) {
            continue;
        }
       ...
        token_array_push(tokens, token);  //  ===> keep pushing
        consumed += token_length;   //  ===>  'token_length' always equals to 0
    }

}

Check if scanner.cursor equals to scanner.start, make sure we can break out of the token scanning loop in tokenize_add_tokens.

SunHao-0 avatar May 29 '21 15:05 SunHao-0