dune icon indicating copy to clipboard operation
dune copied to clipboard

Synopsis & description characters are escaped

Open toastal opened this issue 1 year ago • 9 comments

Expected Behavior

Dune format does nothing:

(package
 (name cafe_menu_mgr)
 (synopsis "Menu manager for a café named, กาแฟ")
 (description "Add & remove items from กาแฟ’s menu, manage prices, etc."))

Actual Behavior

Dune formats does:

(package
 (name cafe_menu_mgr)
 (synopsis
  "Menu manager for a caf\195\169 named, \224\184\129\224\184\178\224\185\129\224\184\159")
 (description
  "Add & remove items from \224\184\129\224\184\178\224\185\129\224\184\159\226\128\153s menu, manage prices, etc."))

Reproduction

  • PR with a reproducing test:
  1. Init a project
  2. Add non-ASCII characters to synopsis / description
  3. Dune Format

Specifications

  • Version of dune (output of dune --version): 3.12.1
  • Version of ocaml (output of ocamlc --version): 4.14.1
  • Operating system (distribution and version): Fedora Linux 39

Additional information

Discovered here: https://discuss.ocaml.org/t/why-can-t-i-create-a-project-with-non-ascii-characters/13865/12

@yawaramin says this shouldn’t happen for certain fields so they can still be human-readable.

toastal avatar Jan 14 '24 02:01 toastal

I'd certainly expect dune to not touch any values inside string fields. opam doesn't.

yawaramin avatar Jan 14 '24 02:01 yawaramin

I had a look at this a while ago, so let me comment on this (actually, even some of the Dune authors have non-ascii characters in their names!).

In short, this is a CST vs AST issue: assuming a UTF-8 encoded dune file, Dune will parse "é" and "\195\169" the same. It will pass the same string down the program, and in particular it should produce the same opam file (probably one with escapes).

To support that properly in dune fmt, the formatter needs to be able to tell the difference between these concrete syntaxes. For example, its String constructor would contain the unescaped contents of strings.

We already have 2 parsers (for AST and CST), but instead of doing AST -> CST (which would be a lossy operation), we instead materialize CSTs from ASTs (where we "invent" which concrete syntax was used) for performance reasons.

The situation is described here:

https://github.com/ocaml/dune/blob/3.12.2/src/dune_sexp/parser.ml#L5-L15

It describes several options: do the conversion the other way around (probably too slow), write another parser, GADT tricks.

emillon avatar Jan 15 '24 15:01 emillon

Can we make dune not parse anything inside double-quotes ie treat it as opaque binary data?

yawaramin avatar Jan 15 '24 17:01 yawaramin

Just to clarify one thing: Dune is not mangling user data here. If you put "é" or "\195\169" in (synopsis) in dune-project, it will generate é in the synopsis: field of the generated opam file. In turn, opam should be able to display that, ocaml.org should be able to render it, etc.

The thing you're seeing is an issue with dune format-dune-file. We can't treat the inside of quoted strings as binary data because we have to find the closing quote, look for escaped quotes, etc. We could have a version that does not interpret the escapes. That's the CST thing I described in my previous message together with some solutions.

To make an analogy with ocaml code, print_endline "a b" or print_endline "a\x20b" are different programs but will do the same thing at runtime. The ocaml compiler does not care about the difference between these programs, ocamlformat does. That's why it has a dedicated parser.

emillon avatar Jan 15 '24 18:01 emillon

Here's a high-level overview of how I think it could be solved:

First, we need tests for this. For example, add the following to format-dune-file.t (or any of the strings in this issue)

  $ dune format-dune-file << EOF
  > ("Étienne")
  > EOF

Template payloads should be covered too, %{bin:é}, but probably not atoms.

Then, to have a working roundtrip through format-dune-file, we need 2 things:

  1. a CST parser that actually parses concrete syntax
  2. a CST formatter that actually formats concrete syntax

The first point is false at the moment: in dune_sexp/lexer.mll, when \t is found (for example), only the single character \t is added to the buffer. To make Encoded.t a CST, we need to store \ and t (two characters). Then, this requires running some conversion in Encoded.to_ast, which is currently the identity function.

(for performance, it is probably critical that to_ast remains the identity function. in a second implementation step, this can be done by passing a boolean argument to the lexer to encode if we're requiring escaped data or not)

This conversion (string -> string) can be implemented as a standalone function (it will require some tests) in Escape. For example, it would map "hello" to "hello" and {|x\ny|} (4 bytes) to "x\ny" (3 bytes). Some error handling needs to be done in this function, for example detecting short escapes like \x2. Actually, if we're doing it that way, it means that while running in CST mode, we'll now accept more programs. We'll need to determine if that's a problem or not (this means that ("\x2") could be formatted.

Some interesting tests we can write at that stage is something to ensure that (= "É" "\195\137") continues to hold. We also need to test the error paths in the lexer, such as "unterminated hexadecimal escape sequence" and other cases in the lexer.

So, at that stage we have step 1 implemented: a CST parser that hold unescaped strings in Template and Quoted_string, as well as a converter to an AST with these details removed. This is not completely enough because the formatter assumes that the contents of strings in Dune_sexp.T are quoted. In particular, Dune_sexp.T.pp calls Escape.quoted. One thing that's a bit difficult with this new code path is that it's not clear if Dune_sexp.T.t values are meant to be quoted or not. The only code path where it's going to be unescaped is this in Dune_lang.Format:

let pp_simple t = Cst.abstract t |> Option.value_exn |> Ast.remove_locs |> Dune_sexp.pp

And it doesn't feel right to go back to Ast as we're trying to print a CST. So maybe this path could be replaced by a new pp function that operates on Cst directly, like the rest of the module does.

emillon avatar Jan 29 '24 16:01 emillon

There's also an "easy" alternative to that, which is to only change Escape.quoted to recognize utf8 and emit utf8 codepoints as is (without backslash escapes). This is attractive but it doesn't preserve the input file, and also it mandates an encoding on the output file, so that's probably not a good solution.

What do you think @rgrinberg ?

emillon avatar Jan 30 '24 09:01 emillon

it mandates an encoding on the output file, so that's probably not a good solution.

Personally, I don't see a problem with this; what would be the downside with enforcing UTF-8 encoding of Dune files?

nojb avatar Jan 30 '24 11:01 nojb

I was thinking of Windows and especially #9396. But reading that conversation, it seems that we're ready to commit to just utf8, which is a great step forward. Leaving the BOM issue (which is about relaxing the compatibility constraints we've made in the past), I think this means that we can just tweak the formatter to allow it to emit utf8 encoded strings when the contents are utf8-encoded, which is a lot simpler than the CST solution proposed. This probably requires some team discussion though so I'll add that to the next meeting.

emillon avatar Jan 30 '24 11:01 emillon

Sticking to utf8 for encoding seems fine to me. We can always retroactively add an explicit toggle if it doesn't work for someone.

rgrinberg avatar Feb 01 '24 00:02 rgrinberg

I believe https://github.com/ocaml/dune/pull/10113 fixed this.

anmonteiro avatar Mar 27 '24 04:03 anmonteiro