grace icon indicating copy to clipboard operation
grace copied to clipboard

Question : how to use unicode based ranges?

Open EmileRolley opened this issue 7 months ago • 6 comments

Hello @johnyob, thanks for your work!

I'm trying to use grace in a compiler. However, our language needs to support unicodes characters as first class-citizens. I wonder what is the easiest way to use the lib with ranges that correspond to unicode characters and not bytes? Maybe by allowing to create custom source readers?

EmileRolley avatar May 19 '25 17:05 EmileRolley

Hey 👋

Grace should be able to support unicode characters out of the box with utf-8 encoding.

johnyob avatar May 20 '25 17:05 johnyob

Hey,

To be more precise, I want to be able to use unicode based ranges instead of byte ranges.

For example, to have the expected output:

https://github.com/johnyob/grace/blob/3519e4b4884d9f83951489fce5075fb477bd762c/test/ansi_renderer/test_ansi_renderer.ml#L446-L469

I would like to specify the range as ~range:(range ~source 7 12)).

EmileRolley avatar May 21 '25 10:05 EmileRolley

What is the underlying data structure you're trying to index into with these ranges?

Grace is built around OCaml's approach of encoding textual data as a sequence of bytes (as such, our ranges are byte index ranges). Is this not sufficient for your use case?

johnyob avatar May 21 '25 13:05 johnyob

I'm parsing UTF-8 text file via yaml and sedlex, which both provides unicode based positions. And as the parsed text is mainly in French, it makes sense to have the position in terms of unicodes instead of bytes, to allow better integration with code editors.

For example, in this source:

élaboré: ma variable

I expect the beginning of ma variable to be at line 1 column 10 instead of column 12 as it would be with bytes.

EmileRolley avatar May 21 '25 14:05 EmileRolley

I'm parsing UTF-8 text file via yaml and sedlex, which both provides unicode based positions. And as the parsed text is mainly in French, it makes sense to have the position in terms of unicodes instead of bytes, to allow better integration with code editors.

Both yaml and sedlex give you byte index positions.

For sedlex, you may use Sedlexing.byte_loc lexbuf to get the start and end positions for the range. For yaml, the Mark.t type for positions contains a field called index which records the byte index.

For code editors, if you provide a list of which language protocols / their relevant docs, I can look at adding dedicated support for them.

johnyob avatar May 21 '25 14:05 johnyob

Both yaml and sedlex give you byte index positions.

For sedlex, you may use Sedlexing.byte_loc lexbuf to get the start and end positions for the range. For yaml, the Mark.t type for positions contains a field called index which records the byte index.

Yes, but in fine I would like to output unicode positions not byte code ones.

EmileRolley avatar May 21 '25 14:05 EmileRolley

Positions and ranges will continue to be represented as byte offsets -- this is a deliberate design choice in Grace. If you have a strong need for Unicode scalar positions, contributions are welcome.

Why bytes?

  • Byte positions are idiomatic in OCaml
  • Most OCaml parsing libraries (e.g., Sedlex, Menhir) operate on byte offsets

johnyob avatar Jun 05 '25 18:06 johnyob