packcc icon indicating copy to clipboard operation
packcc copied to clipboard

Draft: Support unicode characters in character classes

Open dolik-rce opened this issue 4 years ago • 2 comments

This is my attempt to add correct handling of unicode characters in character classes (see #8). The code assumes UTF-8 encoding, which I believe to be a reasonable assumption which is already used on other places of the code (e.g. in unescape_string()).

I have tested the changes with ctags and it seem to work fine for my use case. There are unfortunately no tests in this repository, so I can't really prove it works. But I guess I could add some testing infrastructure and basic tests, if desired.

dolik-rce avatar Jan 03 '21 07:01 dolik-rce

I found some more corner-cases where it doesn't work correctly. It'll need little more work.

dolik-rce avatar Jan 03 '21 08:01 dolik-rce

It should be fully functional now and the generated code is slightly more human readable then before.

The example grammar from #8 now generates this code:

static pcc_thunk_chunk_t *pcc_evaluate_rule_TEST(pcc_context_t *ctx) {
    pcc_thunk_chunk_t *chunk = pcc_thunk_chunk__create(ctx->auxil);
    chunk->pos = ctx->pos;
    {
        char* c;
        if (pcc_refill_buffer(ctx, 1) < 1) goto L0000;
        c = ctx->buffer.buf + ctx->pos;
        int w = pcc_char_utf8_width(*c);
        if (w > 1 && pcc_refill_buffer(ctx, w - 1) < (w - 1)) goto L0000;
        if (!(
            (PCC_UTF8_CODEPOINT(w, c) == 0xe188b4)
        )) goto L0000;
        ctx->pos += w;
    }
    return chunk;
L0000:;
    pcc_thunk_chunk__destroy(ctx->auxil, chunk);
    return NULL;
}

All other cases (multiple characters, negation, ranges etc.) should be also handled correctly.

dolik-rce avatar Jan 03 '21 13:01 dolik-rce