sedlex Matching a unicode character without codepoint

Currently, match%sedlex lexbuf with | "ρ" -> .. does not match, although match%sedlex lexbuf with | math -> if Sedlexing.Utf8.lexeme lexbuf = "ρ" then .. does.

Is there any way of making the first variant work, without having to replace "ρ" with its underlying codepoint, so that I can use it as part of a more complex regexp?

Feb 07 '20 07:02 amblafont

just ran into this also. I tried two different ways:

open Printf

let next_tok buf =
  let open Sedlexing.Utf8 in
  let fn = [%sedlex.regexp? Chars "-+×÷"] in
  match%sedlex buf with
  | Chars "+-×÷" -> sprintf "with Chars: %s" (lexeme buf)
  | "+"|"-"|"×"|"÷" -> sprintf "with Bars: %s" (lexeme buf)
  | _ -> failwith (sprintf "Unexpected character: %s" (lexeme buf))

let test_tok =
  Sedlexing.Utf8.from_string "+" |> next_tok |> print_string; print_newline ();
  Sedlexing.Utf8.from_string "÷" |> next_tok |> print_string; print_newline ();

This prints with Chars: + for the first line and then errors out with exception Failure("Unexpected character: ") for the second.

It's not clear to me whether it's a problem with the match cases or the character iteration (or both) It's not printing the unexpected character which seems to indicate that the lexeme being processed doesn't include the whole unicode codepoint.

Is there a better way to do this?

Jan 12 '22 21:01 ssfrr

looks like I can put the raw codepoints like

  | 0x00D7 | 0x00F7 -> sprintf "with Codepoints: %s" (lexeme buf)

This is a bit of a hassle to generate the codepoints for all the characters I need, but I think is a reasonable workaround.

~It's also still puzzling to me why the "unexpected character" case isn't printing the correct character.~ edit: it looks like lexeme buf is the empty string here, which makes sense for cases where nothing matched.

Jan 12 '22 21:01 ssfrr

OCaml does not currently provide a guarantee that Unicode can be embedded without trouble in source files. It would be nice if everyone agreed that source files were encoded as UTF-8, but that is not yet the case.

Jan 12 '22 21:01 pmetzger

ah, I see. Yes that's unfortunate.

Could sedlex just make that assumption and document that any strings that show up in the match%sedlex clause are assumed to be in UTF8? Alternatively could it use the system's locale to decide?

How does ocamlc interpret source files? If you have a file encoded in UTF16 would it work?

Jan 12 '22 22:01 ssfrr

Could sedlex just make that assumption

No, it can't, unfortunately. Among other things, as things stand, using js_of_ocaml puts you into a situation where strings are interpreted as UTF-16. The situation is messy.

Alternatively could it use the system's locale to decide?

Not really.

How does ocamlc interpret source files?

It used to assume that things were in ISO Latin-1. Then that got very partially obsoleted but without any definitive move to setting an actual reasonable permanent standard. Most programming languages have now adopted UTF-8 as an encoding for source files, but at this point OCaml hasn't.

A lot of people will claim that if you just use UTF-8 in strings this will work most of the time and should be fine. In fact, it can break in subtle ways. It's important that OCaml adopt an actual policy on what the encoding is, but it hasn't.

For now, just use the codepoint for Sedlex and you'll be much happier.

Jan 13 '22 01:01 pmetzger

I see. Thanks for taking the time to explain. I'll use the codepoints.

Jan 13 '22 03:01 ssfrr

No, it can't, unfortunately. Among other things, as things stand, using js_of_ocaml puts you into a situation where strings are interpreted as UTF-16. The situation is messy.

I don't think this is correct. Javascript encodes its strings at runtime as utf-16, javascript source files are usually utf-8. Js_of_ocaml treats OCaml strings as sequence of bytes and even assuming they are utf-8 encoded when converting them to javascript utf-16 ones.

Jan 13 '22 07:01 hhugo

What about providing a new constructor Utf8 and treat ocaml strings inside it as utf8 encoded

let next_tok buf =
  let open Sedlexing.Utf8 in
  let fn = [%sedlex.regexp? Chars "-+×÷"] in
  match%sedlex buf with
  | Utf8 (Chars "+-×÷") -> sprintf "with Chars: %s" (lexeme buf)
  | Utf8 ("+"|"-"|"×"|"÷") -> sprintf "with Bars: %s" (lexeme buf)
  | _ -> failwith (sprintf "Unexpected character: %s" (lexeme buf))

Jan 13 '22 07:01 hhugo

I've implemented a PoC in https://github.com/hhugo/sedlex/tree/utf8

Jan 13 '22 08:01 hhugo

You cannot alter the OCaml lexer through the use of a constructor written at the level of the language. The fact that something "usually" works isn't a guarantee that it will work consistently.

Jan 13 '22 13:01 pmetzger

I don't think there is a need to change the OCaml lexer, all special characters not interpreted verbatim in strings literal are ASCII, and any occurrence of those bytes in the string literal encoded in utf-8 really correspond to those ASCII characters.

Jan 13 '22 13:01 alainfrisch

I don't think there is a need to change the OCaml lexer, all special characters not interpreted verbatim in strings literal are ASCII,

The lexer itself doesn't need to be changed much to adopt a policy of utf-8 encoding throughout, but it does currently happily take Latin-1, including in identifiers and strings, and if you want to hand Unicode code points to code dealing with strings you want tools that can safely presume valid utf-8 is going to be presented to them, meaning one needs to validate that (for example) input strings are valid utf-8.

Jan 13 '22 15:01 pmetzger

(Note that I proposed patches on this a couple of years ago and they got a bunch of pushback. If there's a desire to do this, I'm happy to support it and to get my patches to apply to the current compiler.)

Jan 13 '22 15:01 pmetzger

sedlex sedlex copied to clipboard

Matching a unicode character without codepoint

sedlex
sedlex copied to clipboard