sedlex
sedlex copied to clipboard
Matching a unicode character without codepoint
Currently,
match%sedlex lexbuf with | "ρ" -> ..
does not match, although
match%sedlex lexbuf with | math -> if Sedlexing.Utf8.lexeme lexbuf = "ρ" then ..
does.
Is there any way of making the first variant work, without having to replace "ρ" with its underlying codepoint, so that I can use it as part of a more complex regexp?
just ran into this also. I tried two different ways:
open Printf
let next_tok buf =
let open Sedlexing.Utf8 in
let fn = [%sedlex.regexp? Chars "-+×÷"] in
match%sedlex buf with
| Chars "+-×÷" -> sprintf "with Chars: %s" (lexeme buf)
| "+"|"-"|"×"|"÷" -> sprintf "with Bars: %s" (lexeme buf)
| _ -> failwith (sprintf "Unexpected character: %s" (lexeme buf))
let test_tok =
Sedlexing.Utf8.from_string "+" |> next_tok |> print_string; print_newline ();
Sedlexing.Utf8.from_string "÷" |> next_tok |> print_string; print_newline ();
This prints with Chars: +
for the first line and then errors out with exception Failure("Unexpected character: ")
for the second.
It's not clear to me whether it's a problem with the match cases or the character iteration (or both) It's not printing the unexpected character which seems to indicate that the lexeme being processed doesn't include the whole unicode codepoint.
Is there a better way to do this?
looks like I can put the raw codepoints like
| 0x00D7 | 0x00F7 -> sprintf "with Codepoints: %s" (lexeme buf)
This is a bit of a hassle to generate the codepoints for all the characters I need, but I think is a reasonable workaround.
~It's also still puzzling to me why the "unexpected character" case isn't printing the correct character.~
edit: it looks like lexeme buf
is the empty string here, which makes sense for cases where nothing matched.
OCaml does not currently provide a guarantee that Unicode can be embedded without trouble in source files. It would be nice if everyone agreed that source files were encoded as UTF-8, but that is not yet the case.
ah, I see. Yes that's unfortunate.
Could sedlex just make that assumption and document that any strings that show up in the match%sedlex
clause are assumed to be in UTF8? Alternatively could it use the system's locale to decide?
How does ocamlc
interpret source files? If you have a file encoded in UTF16 would it work?
Could sedlex just make that assumption
No, it can't, unfortunately. Among other things, as things stand, using js_of_ocaml puts you into a situation where strings are interpreted as UTF-16. The situation is messy.
Alternatively could it use the system's locale to decide?
Not really.
How does ocamlc interpret source files?
It used to assume that things were in ISO Latin-1. Then that got very partially obsoleted but without any definitive move to setting an actual reasonable permanent standard. Most programming languages have now adopted UTF-8 as an encoding for source files, but at this point OCaml hasn't.
A lot of people will claim that if you just use UTF-8 in strings this will work most of the time and should be fine. In fact, it can break in subtle ways. It's important that OCaml adopt an actual policy on what the encoding is, but it hasn't.
For now, just use the codepoint for Sedlex and you'll be much happier.
I see. Thanks for taking the time to explain. I'll use the codepoints.
No, it can't, unfortunately. Among other things, as things stand, using js_of_ocaml puts you into a situation where strings are interpreted as UTF-16. The situation is messy.
I don't think this is correct. Javascript encodes its strings at runtime as utf-16, javascript source files are usually utf-8. Js_of_ocaml treats OCaml strings as sequence of bytes and even assuming they are utf-8 encoded when converting them to javascript utf-16 ones.
What about providing a new constructor Utf8 and treat ocaml strings inside it as utf8 encoded
let next_tok buf =
let open Sedlexing.Utf8 in
let fn = [%sedlex.regexp? Chars "-+×÷"] in
match%sedlex buf with
| Utf8 (Chars "+-×÷") -> sprintf "with Chars: %s" (lexeme buf)
| Utf8 ("+"|"-"|"×"|"÷") -> sprintf "with Bars: %s" (lexeme buf)
| _ -> failwith (sprintf "Unexpected character: %s" (lexeme buf))
I've implemented a PoC in https://github.com/hhugo/sedlex/tree/utf8
You cannot alter the OCaml lexer through the use of a constructor written at the level of the language. The fact that something "usually" works isn't a guarantee that it will work consistently.
I don't think there is a need to change the OCaml lexer, all special characters not interpreted verbatim in strings literal are ASCII, and any occurrence of those bytes in the string literal encoded in utf-8 really correspond to those ASCII characters.
I don't think there is a need to change the OCaml lexer, all special characters not interpreted verbatim in strings literal are ASCII,
The lexer itself doesn't need to be changed much to adopt a policy of utf-8 encoding throughout, but it does currently happily take Latin-1, including in identifiers and strings, and if you want to hand Unicode code points to code dealing with strings you want tools that can safely presume valid utf-8 is going to be presented to them, meaning one needs to validate that (for example) input strings are valid utf-8.
(Note that I proposed patches on this a couple of years ago and they got a bunch of pushback. If there's a desire to do this, I'm happy to support it and to get my patches to apply to the current compiler.)