ocaml icon indicating copy to clipboard operation
ocaml copied to clipboard

Use '$' instead of '.' to separate module from identifiers in symbols

Open tmcgilchrist opened this issue 1 year ago • 30 comments

Currently, OCaml's native code compiler (ocamlopt) mangles names using the following format camlModule.Submodule.itentifier_stamp. This scheme causes issues with LLDB on macOS(#12933) breaking the ability to set breakpoints and needs to be worked around in the MSVC port (#12640). Changing to $ everywhere provides a consistent naming scheme across platforms, supports all targeted assemblers and debuggers (GDB/LLDB). See https://github.com/ocaml/ocaml/issues/12933#issuecomment-1986260857

There are five places that are impacted by this change:

  1. Generated OCaml names become camlModule$Submodule$identifier_stamp
  2. OCaml built in C identifiers _caml_function do not change eg _caml_alloc
  3. Backend specific names like caml_hot$code_end and caml_system$frametable change to use $. This impacts any users of function sections (#8526) who will need to update linker scripts.
  4. caml_system__code_begin and caml_system__code_end still use double underscores because they are referenced from C.
  5. Linux perf's OCaml demangling support needs updating ( See background discussion)

This change fixes the issue with setting breakpoints in LLDB on macOS (#12933), keeps the original fixes (#8998, #11321) from #11430.

tmcgilchrist avatar Mar 26 '24 03:03 tmcgilchrist

Note that this will break existing demangling implementations, producing non-ascii characters (due to the use of $xx encoding).

At least on Linux, I'd suggest keeping the existing . separator (in my testing, setting breakpoints in symbols containing . works in both lldb and gdb on Linux).

copy avatar Apr 02 '24 09:04 copy

Note that this will break existing demangling implementations, producing non-ascii characters (due to the use of $xx encoding).

Could you provide more detail about which demangling implementations you mean? I am aware that Linux perf needs updating.

The intention with this change is to have a common demangling scheme across operating systems, $ was chosen because it supports Windows MSVC, macOS LLDB and Linux (gdb and lldb).

tmcgilchrist avatar Apr 03 '24 05:04 tmcgilchrist

Could you provide more detail about which demangling implementations you mean? I am aware that Linux perf needs updating.

Here is another one: https://github.com/mstange/samply/blob/bb0dbdf13f10dc22036ef2dc4ead35b9647717d6/samply-symbols/src/demangle_ocaml.rs I'd assume all demangling implementations are broken.

The intention with this change is to have a common demangling scheme across operating systems, $ was chosen because it supports Windows MSVC, macOS LLDB and Linux (gdb and lldb).

While uniformity is nice, this causes a fair amount of breakage and makes demangling more complicated. And . seems to work fine on Linux (and maybe future version of macOS too).

copy avatar Apr 03 '24 06:04 copy

I see the hot potato was assigned to me...

With @let-def's advice and Wikipedia's wisdom (https://en.m.wikipedia.org/wiki/Unambiguous_finite_automaton), I coded an automaton-based check for ambiguities in name mangling schemas: https://gist.github.com/xavierleroy/ace2f70d664d738970aa3034e0751d3d . The verdict is as follows:

Encoding with . separator : unambiguous
Encoding with $ separator : ambiguous, example: Z$ff$ff
Encoding with $ separator and ASCII chars only : unambiguous
Encoding with $ separator and decimal escapes: unambiguous

So, yes, the change proposed in this PR re-introduces an ambiguity in name mangling that the switch to . as a separator managed to avoid. However, the ambiguity happens only if bytes above 0x80 are used, e.g. in the past with Latin-1 characters in identifiers, and in the future with UTF8-encoded Unicode characters in identifiers (#12664, https://github.com/whitequark/ocaml-m17n/ ). To be future-proof, a change of encoding of special characters should be considered, from $xx (two hex digits) to e.g. $ddd (three decimal digits).

xavierleroy avatar Apr 03 '24 17:04 xavierleroy

To be future-proof, a change of encoding of special characters should be considered, from $xx (two hex digits) to e.g. $ddd (three decimal digits).

@xavierleroy would you want to see that escaping change made in this PR? It's not clear how large a change that is, I am currently looking at the code.

@copy I wasn't aware of that project, thank you for the link. Not having used it did that project break with the change from __ to . separators? Would it be sufficient to contribute a fix to the demangler there if this change is accepted? If we get this change right (including accomodating Unicode identifiers) then it shouldn't need to change again.

tmcgilchrist avatar Apr 04 '24 05:04 tmcgilchrist

Not having used it did that project break with the change from __ to . separators?

No, the . was already printed as-is in the original implementation.

Would it be sufficient to contribute a fix to the demangler there if this change is accepted?

Yes, that would be appreciated.

FWIW, I tested perf report with this change. Non-ascii characters are printed as <XX> (e.g. Stdlib.Map<AD>d_436), so at least the terminal isn't messed up.

To be future-proof, a change of encoding of special characters should be considered, from $xx (two hex digits) to e.g. $ddd (three decimal digits).

One thing to keep in mind is that demanglers may want to keep support for older versions of OCaml. I wonder if overloading $ this way unnecessarily makes the demangling algorithm more complicated that necessary.

copy avatar Apr 04 '24 09:04 copy

73c979d introduces octal escaping. For this small example program:

(* fib.ml *)
let rec fib n =
  if n = 0 then 0
  else if n = 1 then 1
  else fib (n-1) + fib (n-2)

(*
   With octal escaping this becomes "_camlFib$$136$052_272"
   With hexidecimal escaping this was "_camlFib$$5e$2a_272"
*)

let (^*) a r = Printf.printf "fib(%d) = %d" a r

let main () =
  let a = 10 in
  let r = fib a in
  a ^* r

let () = main ()

Not sure if there are other places that need updating?

tmcgilchrist avatar Apr 04 '24 20:04 tmcgilchrist

73c979d introduces octal escaping.

Are you using a 36-bit computer? Because otherwise there is no reason whatsoever to use octal. OCaml uses decimal and hexadecimal for character escapes.

xavierleroy avatar Apr 05 '24 07:04 xavierleroy

a36d62399a actually does decimal escaping (leaving the 36bit computer era behind 😀 ). For the same example program, with hexidecimal escaping this was _camlFib$$5e$2a_272 this now becomes _camlFib$$094$042_272.

tmcgilchrist avatar Apr 08 '24 00:04 tmcgilchrist

a36d62399a actually does decimal escaping (leaving the 36bit computer era behind 😀 ).

I'm reassured :-)

xavierleroy avatar Apr 08 '24 18:04 xavierleroy

Tim McGilchrist (2024/04/07 17:13 -0700):

a36d62399a actually does decimal escaping (leaving the 36bit computer era behind 😀 ). For the same example program, with hexidecimal escaping this was _camlFib$$5e$2a_272 this now becomes _camlFib$$094$042_272.

Can you please squash the two commits?

shindere avatar Apr 22 '24 18:04 shindere

Fabian (2024/04/02 02:59 -0700):

Note that this will break existing demangling implementations, producing non-ascii characters (due to the use of $xx encoding).

Apologies for the naïve question: are there that many demanglers around?

I also do understand that none of the approaches will be ideal and still we will have to choose one, or to make the mangling configurable but I am not sure that's something we want to do.

shindere avatar Apr 22 '24 18:04 shindere

Fabian (2024/04/02 23:17 -0700):

While uniformity is nice, this causes a fair amount of breakage and makes demangling more complicated. And . seems to work fine on Linux (and maybe future version of macOS too).

If a demangler knows that it is demangling an OCaml program, am I correct that it can also adapt its demangling algorithm to the version of the executable? Can we perhapssubmit patches that will be able to cope with different separators we used?

shindere avatar Apr 22 '24 18:04 shindere

Thinking about this further: am I correct that it should be possible to patch the demanglers so that they can (roughly) accomodate the two mangling schemes? Based on that a name can't contain both a . and a $, I'd expect it to be possible to figure out which mangling scheme has been used and thus know how to demangle?

shindere avatar Apr 22 '24 20:04 shindere

Apologies for the naïve question: are there that many demanglers around?

I'm not aware of any other demanglers except the ones already mentioned in this thread.

I also do understand that none of the approaches will be ideal and still we will have to choose one, or to make the mangling configurable but I am not sure that's something we want to do.

My suggestion would be to keep using . on Linux, report the llvm issue upstream (note that . is already handled correctly in lldb on Linux), and eventually switch back to . on macOS.

Thinking about this further: am I correct that it should be possible to patch the demanglers so that they can (roughly) accomodate the two mangling schemes? Based on that a name can't contain both a . and a $, I'd expect it to be possible to figure out which mangling scheme has been used and thus know how to demangle?

4.14 and 5.0 symbols don't contain ., but use the old mangling scheme. The following may be sufficient:

  • If symbol contains .: Hex escapes (5.1)
  • If symbol doesn't contain $ or character after first $ matches [0-9]: Hex escapes (4.14/5.0)
  • Otherwise: Decimal escapes, except for the first $ (this PR)

But I prefer my suggestion above, for the following reasons:

  • It doesn't break existing demanglers, which might take a while to get updated
  • It keeps the demangling algorithm simple, which is implemented in C in perf
  • It doesn't overload $xxx in symbols between versions of OCaml
  • . is a little more readable in mangled symbols
  • . can later be used to separate submodules

Anyway, it's not a hill I will die on.

copy avatar Apr 23 '24 07:04 copy

The Rust mangling system, I think wisely, includes a mangling convention version number in every mangled identifier. If we don't do that now, we should consider it the next time we look at name mangling. Maybe one day we will end up with something like _O0$Mod$Submod$fun$line$uniq (most languages seem to go for _<L>... where L identifies the language). In any case, a demangler should be careful about false positives, as it may be mistaken or misinformed about the source language of any given symbol, or out-of-date with respect to mangling conventions in any given source language. As @shindere says, I think it's fairly easy to accommodate both mangling conventions in a single demangler (which is particularly good as this change brings all platforms into line: any demangler which can't handle $ would currently be broken on MSVC).

NickBarnes avatar May 14 '24 12:05 NickBarnes

My suggestion would be to keep using . on Linux, report the llvm issue upstream (note that . is already handled correctly in lldb on Linux), and eventually switch back to . on macOS.

In my experience, reporting this kind of bug to Apple is just a waste of time...

damiendoligez avatar May 15 '24 13:05 damiendoligez

@xavierleroy Thank you for the ambiguity code and suggestions for additional improvements. I have taken the opportunity to add you as a contributor to this fix (if you prefer otherwise please let me know).

In my experience, reporting this kind of bug to Apple is just a waste of time...

My perspective on this is, I need to work with the functionality available from lldb and Apple. LLDB doesn't currently support breakpoints with . in them and even when/if a fix is available that won't necessarily flow down to older macOS versions. I've also had reports of GDB not working with setting breakpoints with . in them.

tmcgilchrist avatar May 24 '24 01:05 tmcgilchrist

I wonder if the Changes entry shouldn't start with a * to mark the PR as a breaking change, rather than with the currnet - which marks it as non-breaking?

shindere avatar Jun 18 '24 13:06 shindere

@shindere good catch, I have changed it to the breaking change version.

tmcgilchrist avatar Jun 19 '24 10:06 tmcgilchrist

Tim McGilchrist (2024/06/19 03:27 -0700):

@shindere good catch, I have changed it to the breaking change version.

Cool! Thanks!

shindere avatar Jun 19 '24 10:06 shindere

@xavierleroy there are two approvals on this but nothing from a core maintainer. Are you comfortable approving this change so it can be merged?

tmcgilchrist avatar Jun 20 '24 10:06 tmcgilchrist

@gasche made a suggestion (perhaps privately) to use $$ as separator and $xx (two hex digits) as escapes, one of the reasons being that it requires fewer changes in the existing perf symbol demangler than what's currently in this PR. (The escape decoding logic remains the same; it just has to map $$ to ..) I think this should be considered before merging anything.

xavierleroy avatar Jun 20 '24 12:06 xavierleroy

(The suggestion was in https://github.com/ocaml/ocaml/issues/12933#issuecomment-2133633119, and @mshinwell proposed a variant in https://github.com/ocaml/ocaml/issues/12933#issuecomment-2151581072 where $ is used as the modular separator and $$ as the escaping symbol, to optimize the common case where we don't need any character escapes. I'm fine with either suggestions! I believe that @tmcgilchrist prototyped at least one of them.)

gasche avatar Jun 20 '24 12:06 gasche

I have an implementation of each proposal:

  1. Use '$' instead of '.' to separate module from identifiers in symbols with $ plus 3 digit decimal escaping. ~~This PR.~~ Now available on https://github.com/tmcgilchrist/ocaml/tree/name_mangling_dollar_decimal_escape
  2. Use '$' instead of '.' with $$ hex escaping https://github.com/tmcgilchrist/ocaml/tree/name_mangling_mod_dollar_dollar
  3. Use '$$' instead of '.' with no changes to escaping ($ hex escaping) https://github.com/tmcgilchrist/ocaml/tree/name_mangling_dollar_dollar

With option 2 being the suggestion from @xavierleroy with @mshinwell's supporting comment. In offline discussion with Mark, Nick and myself we are happy with that solution. If there is agreement on option 2 I will update this PR to reflect that, please advise @gasche @xavierleroy

tmcgilchrist avatar Jun 21 '24 12:06 tmcgilchrist

I have a preference for (2) and (3) over (1), but I don't want to be blocking this issue, so if there is overwhelming support for (1) I can look the other way.

gasche avatar Jun 21 '24 13:06 gasche

Option (2) seems satisfactory to all expressed reviews as far as we could judge in last Wednesday’s developer meeting. In light of that, would it be relevant to modify this PR’s patch to be the patch for option (2)?

OlivierNicole avatar Jul 01 '24 08:07 OlivierNicole

As I observed in my comment https://github.com/ocaml/ocaml/pull/13050#issuecomment-2180587728, option (3) requires fewer changes in the existing perf symbol demangler than the other options.

xavierleroy avatar Jul 04 '24 16:07 xavierleroy

I have updated this PR inline with option (2) and updated https://github.com/ocaml/ocaml/pull/13050#issuecomment-2182651461 with pointers to all the possible implementations.

@xavierleroy regarding Linux perf name mangling:

  1. 4.14/5.0 is handled with the existing perf code ✅
  2. 5.1/5.2 also works due to perf replacing __ with . in it's implementation demangle-ocaml.c
  3. Either option 2 or 3 will need changes (I'm testing them out now), they both seem minimal enough that I don't see choosing 3 over 2 as a compelling argument.

Additionally I would like to improve the name mangling further so that demangled names can be produced from the mangled name with no extra information see which isn't currently possible.

tmcgilchrist avatar Jul 05 '24 00:07 tmcgilchrist