packcc
packcc copied to clipboard
Draft: Support unicode characters in character classes
This is my attempt to add correct handling of unicode characters in character classes (see #8). The code assumes UTF-8 encoding, which I believe to be a reasonable assumption which is already used on other places of the code (e.g. in unescape_string()
).
I have tested the changes with ctags and it seem to work fine for my use case. There are unfortunately no tests in this repository, so I can't really prove it works. But I guess I could add some testing infrastructure and basic tests, if desired.
I found some more corner-cases where it doesn't work correctly. It'll need little more work.
It should be fully functional now and the generated code is slightly more human readable then before.
The example grammar from #8 now generates this code:
static pcc_thunk_chunk_t *pcc_evaluate_rule_TEST(pcc_context_t *ctx) {
pcc_thunk_chunk_t *chunk = pcc_thunk_chunk__create(ctx->auxil);
chunk->pos = ctx->pos;
{
char* c;
if (pcc_refill_buffer(ctx, 1) < 1) goto L0000;
c = ctx->buffer.buf + ctx->pos;
int w = pcc_char_utf8_width(*c);
if (w > 1 && pcc_refill_buffer(ctx, w - 1) < (w - 1)) goto L0000;
if (!(
(PCC_UTF8_CODEPOINT(w, c) == 0xe188b4)
)) goto L0000;
ctx->pos += w;
}
return chunk;
L0000:;
pcc_thunk_chunk__destroy(ctx->auxil, chunk);
return NULL;
}
All other cases (multiple characters, negation, ranges etc.) should be also handled correctly.