Use '$' instead of '.' to separate module from identifiers in symbols
Currently, OCaml's native code compiler (ocamlopt) mangles names using the following format camlModule.Submodule.itentifier_stamp. This scheme causes issues with LLDB on macOS(#12933) breaking the ability to set breakpoints and needs to be worked around in the MSVC port (#12640). Changing to $ everywhere provides a consistent naming scheme across platforms, supports all targeted assemblers and debuggers (GDB/LLDB). See https://github.com/ocaml/ocaml/issues/12933#issuecomment-1986260857
There are five places that are impacted by this change:
- Generated OCaml names become
camlModule$Submodule$identifier_stamp - OCaml built in C identifiers
_caml_functiondo not change eg _caml_alloc - Backend specific names like caml_hot$code_end and caml_system$frametable change to use
$. This impacts any users of function sections (#8526) who will need to update linker scripts. caml_system__code_beginandcaml_system__code_endstill use double underscores because they are referenced from C.- Linux perf's OCaml demangling support needs updating ( See background discussion)
This change fixes the issue with setting breakpoints in LLDB on macOS (#12933), keeps the original fixes (#8998, #11321) from #11430.
Note that this will break existing demangling implementations, producing non-ascii characters (due to the use of $xx encoding).
At least on Linux, I'd suggest keeping the existing . separator (in my testing, setting breakpoints in symbols containing . works in both lldb and gdb on Linux).
Note that this will break existing demangling implementations, producing non-ascii characters (due to the use of $xx encoding).
Could you provide more detail about which demangling implementations you mean? I am aware that Linux perf needs updating.
The intention with this change is to have a common demangling scheme across operating systems, $ was chosen because it supports Windows MSVC, macOS LLDB and Linux (gdb and lldb).
Could you provide more detail about which demangling implementations you mean? I am aware that Linux perf needs updating.
Here is another one: https://github.com/mstange/samply/blob/bb0dbdf13f10dc22036ef2dc4ead35b9647717d6/samply-symbols/src/demangle_ocaml.rs I'd assume all demangling implementations are broken.
The intention with this change is to have a common demangling scheme across operating systems, $ was chosen because it supports Windows MSVC, macOS LLDB and Linux (gdb and lldb).
While uniformity is nice, this causes a fair amount of breakage and makes demangling more complicated. And . seems to work fine on Linux (and maybe future version of macOS too).
I see the hot potato was assigned to me...
With @let-def's advice and Wikipedia's wisdom (https://en.m.wikipedia.org/wiki/Unambiguous_finite_automaton), I coded an automaton-based check for ambiguities in name mangling schemas: https://gist.github.com/xavierleroy/ace2f70d664d738970aa3034e0751d3d . The verdict is as follows:
Encoding with . separator : unambiguous
Encoding with $ separator : ambiguous, example: Z$ff$ff
Encoding with $ separator and ASCII chars only : unambiguous
Encoding with $ separator and decimal escapes: unambiguous
So, yes, the change proposed in this PR re-introduces an ambiguity in name mangling that the switch to . as a separator managed to avoid. However, the ambiguity happens only if bytes above 0x80 are used, e.g. in the past with Latin-1 characters in identifiers, and in the future with UTF8-encoded Unicode characters in identifiers (#12664, https://github.com/whitequark/ocaml-m17n/ ). To be future-proof, a change of encoding of special characters should be considered, from $xx (two hex digits) to e.g. $ddd (three decimal digits).
To be future-proof, a change of encoding of special characters should be considered, from $xx (two hex digits) to e.g. $ddd (three decimal digits).
@xavierleroy would you want to see that escaping change made in this PR? It's not clear how large a change that is, I am currently looking at the code.
@copy I wasn't aware of that project, thank you for the link. Not having used it did that project break with the change from __ to . separators? Would it be sufficient to contribute a fix to the demangler there if this change is accepted? If we get this change right (including accomodating Unicode identifiers) then it shouldn't need to change again.
Not having used it did that project break with the change from
__to.separators?
No, the . was already printed as-is in the original implementation.
Would it be sufficient to contribute a fix to the demangler there if this change is accepted?
Yes, that would be appreciated.
FWIW, I tested perf report with this change. Non-ascii characters are printed as <XX> (e.g. Stdlib.Map<AD>d_436), so at least the terminal isn't messed up.
To be future-proof, a change of encoding of special characters should be considered, from $xx (two hex digits) to e.g. $ddd (three decimal digits).
One thing to keep in mind is that demanglers may want to keep support for older versions of OCaml. I wonder if overloading $ this way unnecessarily makes the demangling algorithm more complicated that necessary.
73c979d introduces octal escaping. For this small example program:
(* fib.ml *)
let rec fib n =
if n = 0 then 0
else if n = 1 then 1
else fib (n-1) + fib (n-2)
(*
With octal escaping this becomes "_camlFib$$136$052_272"
With hexidecimal escaping this was "_camlFib$$5e$2a_272"
*)
let (^*) a r = Printf.printf "fib(%d) = %d" a r
let main () =
let a = 10 in
let r = fib a in
a ^* r
let () = main ()
Not sure if there are other places that need updating?
73c979d introduces octal escaping.
Are you using a 36-bit computer? Because otherwise there is no reason whatsoever to use octal. OCaml uses decimal and hexadecimal for character escapes.
a36d62399a actually does decimal escaping (leaving the 36bit computer era behind 😀 ). For the same example program, with hexidecimal escaping this was _camlFib$$5e$2a_272 this now becomes _camlFib$$094$042_272.
a36d62399a actually does decimal escaping (leaving the 36bit computer era behind 😀 ).
I'm reassured :-)
Tim McGilchrist (2024/04/07 17:13 -0700):
a36d62399a actually does decimal escaping (leaving the 36bit computer era behind 😀 ). For the same example program, with hexidecimal escaping this was
_camlFib$$5e$2a_272this now becomes_camlFib$$094$042_272.
Can you please squash the two commits?
Fabian (2024/04/02 02:59 -0700):
Note that this will break existing demangling implementations, producing non-ascii characters (due to the use of
$xxencoding).
Apologies for the naïve question: are there that many demanglers around?
I also do understand that none of the approaches will be ideal and still we will have to choose one, or to make the mangling configurable but I am not sure that's something we want to do.
Fabian (2024/04/02 23:17 -0700):
While uniformity is nice, this causes a fair amount of breakage and makes demangling more complicated. And
.seems to work fine on Linux (and maybe future version of macOS too).
If a demangler knows that it is demangling an OCaml program, am I correct that it can also adapt its demangling algorithm to the version of the executable? Can we perhapssubmit patches that will be able to cope with different separators we used?
Thinking about this further: am I correct that it should be possible to patch the demanglers so that they can (roughly) accomodate the two mangling schemes? Based on that a name can't contain both a . and a $, I'd expect it to be possible to figure out which mangling scheme has been used and thus know how to demangle?
Apologies for the naïve question: are there that many demanglers around?
I'm not aware of any other demanglers except the ones already mentioned in this thread.
I also do understand that none of the approaches will be ideal and still we will have to choose one, or to make the mangling configurable but I am not sure that's something we want to do.
My suggestion would be to keep using . on Linux, report the llvm issue upstream (note that . is already handled correctly in lldb on Linux), and eventually switch back to . on macOS.
Thinking about this further: am I correct that it should be possible to patch the demanglers so that they can (roughly) accomodate the two mangling schemes? Based on that a name can't contain both a . and a $, I'd expect it to be possible to figure out which mangling scheme has been used and thus know how to demangle?
4.14 and 5.0 symbols don't contain ., but use the old mangling scheme. The following may be sufficient:
- If symbol contains
.: Hex escapes (5.1) - If symbol doesn't contain
$or character after first$matches[0-9]: Hex escapes (4.14/5.0) - Otherwise: Decimal escapes, except for the first
$(this PR)
But I prefer my suggestion above, for the following reasons:
- It doesn't break existing demanglers, which might take a while to get updated
- It keeps the demangling algorithm simple, which is implemented in C in
perf - It doesn't overload
$xxxin symbols between versions of OCaml .is a little more readable in mangled symbols.can later be used to separate submodules
Anyway, it's not a hill I will die on.
The Rust mangling system, I think wisely, includes a mangling convention version number in every mangled identifier. If we don't do that now, we should consider it the next time we look at name mangling. Maybe one day we will end up with something like _O0$Mod$Submod$fun$line$uniq (most languages seem to go for _<L>... where L identifies the language). In any case, a demangler should be careful about false positives, as it may be mistaken or misinformed about the source language of any given symbol, or out-of-date with respect to mangling conventions in any given source language. As @shindere says, I think it's fairly easy to accommodate both mangling conventions in a single demangler (which is particularly good as this change brings all platforms into line: any demangler which can't handle $ would currently be broken on MSVC).
My suggestion would be to keep using
.on Linux, report the llvm issue upstream (note that.is already handled correctly in lldb on Linux), and eventually switch back to.on macOS.
In my experience, reporting this kind of bug to Apple is just a waste of time...
@xavierleroy Thank you for the ambiguity code and suggestions for additional improvements. I have taken the opportunity to add you as a contributor to this fix (if you prefer otherwise please let me know).
In my experience, reporting this kind of bug to Apple is just a waste of time...
My perspective on this is, I need to work with the functionality available from lldb and Apple. LLDB doesn't currently support breakpoints with . in them and even when/if a fix is available that won't necessarily flow down to older macOS versions. I've also had reports of GDB not working with setting breakpoints with . in them.
I wonder if the Changes entry shouldn't start with a * to mark the PR as a breaking change, rather than with the currnet - which marks it as non-breaking?
@shindere good catch, I have changed it to the breaking change version.
Tim McGilchrist (2024/06/19 03:27 -0700):
@shindere good catch, I have changed it to the breaking change version.
Cool! Thanks!
@xavierleroy there are two approvals on this but nothing from a core maintainer. Are you comfortable approving this change so it can be merged?
@gasche made a suggestion (perhaps privately) to use $$ as separator and $xx (two hex digits) as escapes, one of the reasons being that it requires fewer changes in the existing perf symbol demangler than what's currently in this PR. (The escape decoding logic remains the same; it just has to map $$ to ..) I think this should be considered before merging anything.
(The suggestion was in https://github.com/ocaml/ocaml/issues/12933#issuecomment-2133633119, and @mshinwell proposed a variant in https://github.com/ocaml/ocaml/issues/12933#issuecomment-2151581072 where $ is used as the modular separator and $$ as the escaping symbol, to optimize the common case where we don't need any character escapes. I'm fine with either suggestions! I believe that @tmcgilchrist prototyped at least one of them.)
I have an implementation of each proposal:
- Use '$' instead of '.' to separate module from identifiers in symbols with $ plus 3 digit decimal escaping. ~~This PR.~~ Now available on https://github.com/tmcgilchrist/ocaml/tree/name_mangling_dollar_decimal_escape
- Use '$' instead of '.' with
$$hex escaping https://github.com/tmcgilchrist/ocaml/tree/name_mangling_mod_dollar_dollar - Use '$$' instead of '.' with no changes to escaping ($ hex escaping) https://github.com/tmcgilchrist/ocaml/tree/name_mangling_dollar_dollar
With option 2 being the suggestion from @xavierleroy with @mshinwell's supporting comment. In offline discussion with Mark, Nick and myself we are happy with that solution. If there is agreement on option 2 I will update this PR to reflect that, please advise @gasche @xavierleroy
I have a preference for (2) and (3) over (1), but I don't want to be blocking this issue, so if there is overwhelming support for (1) I can look the other way.
Option (2) seems satisfactory to all expressed reviews as far as we could judge in last Wednesday’s developer meeting. In light of that, would it be relevant to modify this PR’s patch to be the patch for option (2)?
As I observed in my comment https://github.com/ocaml/ocaml/pull/13050#issuecomment-2180587728, option (3) requires fewer changes in the existing perf symbol demangler than the other options.
I have updated this PR inline with option (2) and updated https://github.com/ocaml/ocaml/pull/13050#issuecomment-2182651461 with pointers to all the possible implementations.
@xavierleroy regarding Linux perf name mangling:
- 4.14/5.0 is handled with the existing perf code ✅
- 5.1/5.2 also works due to perf replacing
__with.in it's implementation demangle-ocaml.c ✅ - Either option 2 or 3 will need changes (I'm testing them out now), they both seem minimal enough that I don't see choosing 3 over 2 as a compelling argument.
Additionally I would like to improve the name mangling further so that demangled names can be produced from the mangled name with no extra information see which isn't currently possible.