magic-trace icon indicating copy to clipboard operation
magic-trace copied to clipboard

Broken stack (with "1 instruction trace errors" warning) on non-tailrecursive code

Open copy opened this issue 3 years ago • 5 comments

For example, running magic-trace run -full-execution on this code:

let rec go dir =
  if (Unix.lstat dir).st_kind = Unix.S_DIR then
    let rec handle_dir d =
      match Unix.readdir d with
      | exception End_of_file -> Unix.closedir d; ()
      | "." | ".." -> handle_dir d
      | subdir -> go (Filename.concat dir subdir); handle_dir d
    in
    handle_dir (Unix.opendir dir)
  else
    ()

(* a large directory structure, for example the ocaml source code checked out via git *)
let dir = "repositories/ocaml"
let () = ignore (go dir)

Prints Warning: 1 instruction trace errors and shows a cut-off stack as shown below (if a bigger directory is chosen, it happens more than once):

Screenshot_2022-07-13_21-06-00

The issue happens in similar code that doesn't use exceptions, so I don't think irregular control flow is the issue. Just for reference, this is in a 4.14 flambda switch and the executable is built with dune --profile=release.

copy avatar Jul 13 '22 12:07 copy

First of all, thanks for reporting this. A broken stack given only one trace error certainly sounds like it could be tickling a bug. I'll dig into this properly this weekend.

In case you just want a usable trace, there are two remedies for trace errors that tend to work for most people:

  • Try -timer-resolution Low. Your timestamps will be less accurate (there is only a timer update every 1us-ish). But, in exchange, you will be far less likely to see trace errors.
  • Switch to a later generation Intel CPU. Intel has been actively improving IPT, and each generation tends to be much better than the previous one about this.

Out of curiosity, what CPU are you using?

cgaebel avatar Jul 13 '22 14:07 cgaebel

Thanks! -timer-resolution Low doesn't seem to make much of a difference. It looks like the first broken stack happens later, but after that a similar breakage happens. If you can't reproduce it with my example above, try pointing it at a larger directory. The number of trace errors goes up the deeper the stack.

Out of curiosity, what CPU are you using?

Intel i9-9900KS

copy avatar Jul 13 '22 14:07 copy

I find that on my machine (Intel i5-1145G7) running magic-trace with the environment variable MAGIC_TRACE_NO_DLFILTER=1 seems to help. There are still lots of decode errors and exceptional control flow (not in your code, but in core startup code when I adapted your example to use Core_unix), but those are known problems.

For fellow magic-trace devs:

It looks like the dlfilter is eating trace errors. Here's a relevant excerpt when running without dlfilter:

1539481/1539481 2418006.848063682:          1 branches:uH:   call                       556ff0af70b8 unix_inet_addr_of_string+0x38 (/tmp/issue_246) =>     7fc9ca8563e0 inet_pton+0x0 (/usr/lib/libc.so.6)
1539481/1539481 2418006.848063890:          1 branches:uH:   tr end                     7fc9ca8563e0 inet_pton+0x0 (/usr/lib/libc.so.6) =>                0 [unknown] ([unknown])
1539481/1539481 2418006.848065765:          1 branches:uH:   tr strt                               0 [unknown] ([unknown]) =>     7fc9ca8563e0 inet_pton+0x0 (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065765:          1 branches:uH:   call                       7fc9ca8563f3 inet_pton+0x13 (/usr/lib/libc.so.6) =>     7fc9ca74c2b0 [unknown] (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065765:          1 branches:uH:   jmp                        7fc9ca74c2b4 [unknown] (/usr/lib/libc.so.6) =>     7fc9ca8964c0 [unknown] (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065765:          1 branches:uH:   return                     7fc9ca8964ef [unknown] (/usr/lib/libc.so.6) =>     7fc9ca8563f8 inet_pton+0x18 (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065765:          1 branches:uH:   jcc                        7fc9ca8563fb inet_pton+0x1b (/usr/lib/libc.so.6) =>     7fc9ca856420 inet_pton+0x40 (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065765:          1 branches:uH:   jmp                        7fc9ca85642e inet_pton+0x4e (/usr/lib/libc.so.6) =>     7fc9ca856090 [unknown] (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065973:          1 branches:uH:   jmp                        7fc9ca8560c3 [unknown] (/usr/lib/libc.so.6) =>     7fc9ca8560f8 [unknown] (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065973:          1 branches:uH:   jcc                        7fc9ca856105 [unknown] (/usr/lib/libc.so.6) =>     7fc9ca8560c8 [unknown] (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065973:          1 branches:uH:   jcc                        7fc9ca856126 [unknown] (/usr/lib/libc.so.6) =>     7fc9ca8560f8 [unknown] (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065973:          1 branches:uH:   jcc                        7fc9ca856105 [unknown] (/usr/lib/libc.so.6) =>     7fc9ca8560c8 [unknown] (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065973:          1 branches:uH:   jcc                        7fc9ca856126 [unknown] (/usr/lib/libc.so.6) =>     7fc9ca8560f8 [unknown] (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065973:          1 branches:uH:   jcc                        7fc9ca856105 [unknown] (/usr/lib/libc.so.6) =>     7fc9ca8560c8 [unknown] (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065973:          1 branches:uH:   jcc                        7fc9ca856126 [unknown] (/usr/lib/libc.so.6) =>     7fc9ca8560f8 [unknown] (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065973:          1 branches:uH:   jcc                        7fc9ca856105 [unknown] (/usr/lib/libc.so.6) =>     7fc9ca8560c8 [unknown] (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065973:          1 branches:uH:   jcc                        7fc9ca8560f6 [unknown] (/usr/lib/libc.so.6) =>     7fc9ca856128 [unknown] (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065973:          1 branches:uH:   jmp                        7fc9ca856139 [unknown] (/usr/lib/libc.so.6) =>     7fc9ca85615a [unknown] (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065973:          1 branches:uH:   return                     7fc9ca85616e [unknown] (/usr/lib/libc.so.6) =>     556ff0af70be unix_inet_addr_of_string+0x3e (/tmp/issue_246)
1539481/1539481 2418006.848065973:          1 branches:uH:   jcc                        556ff0af70c0 unix_inet_addr_of_string+0x40 (/tmp/issue_246) =>     556ff0af7100 unix_inet_addr_of_string+0x80 (/tmp/issue_246)
1539481/1539481 2418006.848065973:          1 branches:uH:   call                       556ff0af7103 unix_inet_addr_of_string+0x83 (/tmp/issue_246) =>     556ff0afc0b0 alloc_inet_addr+0x0 (/tmp/issue_246)

And here it is with the dlfilter:

1539481/1539481 2418006.848063682:          1 branches:uH:   call                       556ff0af70b8 unix_inet_addr_of_string+0x38 (/tmp/issue_246) =>     7fc9ca8563e0 inet_pton+0x0 (/usr/lib/libc.so.6)
1539481/1539481 2418006.848065973:          1 branches:uH:   call                       556ff0af7103 unix_inet_addr_of_string+0x83 (/tmp/issue_246) =>     556ff0afc0b0 alloc_inet_addr+0x0 (/tmp/issue_246)

Both these excerpts are from the same spot in the same trace file.

cgaebel avatar Aug 06 '22 18:08 cgaebel

Hmm we only added dlfilter recently, so it's just in master. Were you experiencing this issue using the latest release of magic-trace?

cgaebel avatar Aug 06 '22 18:08 cgaebel

To be clear, the trace I'm seeing without dlfilter doesn't look very good. But all the brokenness I see is due to decode errors, and only Intel can do anything about those. When a big tower of stackframes ends abruptly, see if there's a little triangle at the bottom. If you click that and it says [Decode error: Overflow error], that's a decode error.

cgaebel avatar Aug 06 '22 18:08 cgaebel