grace Question : how to use unicode based ranges?

Hello @johnyob, thanks for your work!

I'm trying to use grace in a compiler. However, our language needs to support unicodes characters as first class-citizens. I wonder what is the easiest way to use the lib with ranges that correspond to unicode characters and not bytes? Maybe by allowing to create custom source readers?

May 19 '25 17:05 EmileRolley

Hey 👋

Grace should be able to support unicode characters out of the box with utf-8 encoding.

May 20 '25 17:05 johnyob

Hey,

To be more precise, I want to be able to use unicode based ranges instead of byte ranges.

For example, to have the expected output:

https://github.com/johnyob/grace/blob/3519e4b4884d9f83951489fce5075fb477bd762c/test/ansi_renderer/test_ansi_renderer.ml#L446-L469

I would like to specify the range as ~range:(range ~source 7 12)).

May 21 '25 10:05 EmileRolley

What is the underlying data structure you're trying to index into with these ranges?

Grace is built around OCaml's approach of encoding textual data as a sequence of bytes (as such, our ranges are byte index ranges). Is this not sufficient for your use case?

May 21 '25 13:05 johnyob

I'm parsing UTF-8 text file via yaml and sedlex, which both provides unicode based positions. And as the parsed text is mainly in French, it makes sense to have the position in terms of unicodes instead of bytes, to allow better integration with code editors.

For example, in this source:

élaboré: ma variable

I expect the beginning of ma variable to be at line 1 column 10 instead of column 12 as it would be with bytes.

May 21 '25 14:05 EmileRolley

I'm parsing UTF-8 text file via yaml and sedlex, which both provides unicode based positions. And as the parsed text is mainly in French, it makes sense to have the position in terms of unicodes instead of bytes, to allow better integration with code editors.

Both yaml and sedlex give you byte index positions.

For sedlex, you may use Sedlexing.byte_loc lexbuf to get the start and end positions for the range. For yaml, the Mark.t type for positions contains a field called index which records the byte index.

For code editors, if you provide a list of which language protocols / their relevant docs, I can look at adding dedicated support for them.

May 21 '25 14:05 johnyob

Both yaml and sedlex give you byte index positions.

For sedlex, you may use Sedlexing.byte_loc lexbuf to get the start and end positions for the range. For yaml, the Mark.t type for positions contains a field called index which records the byte index.

Yes, but in fine I would like to output unicode positions not byte code ones.

May 21 '25 14:05 EmileRolley

Positions and ranges will continue to be represented as byte offsets -- this is a deliberate design choice in Grace. If you have a strong need for Unicode scalar positions, contributions are welcome.

Why bytes?

Byte positions are idiomatic in OCaml
Most OCaml parsing libraries (e.g., Sedlex, Menhir) operate on byte offsets

Jun 05 '25 18:06 johnyob