tracy icon indicating copy to clipboard operation
tracy copied to clipboard

call stacks: dladdr() doesn't work with -fvisibility=hidden symbols

Open tycho opened this issue 3 years ago • 11 comments

The call stack functionality on Linux tries to use dladdr() to resolve symbol names. For shared libraries built with -fvisibility=hidden, dladdr() cannot resolve the symbol names that aren't explicitly marked with the default visibility, even though debugging information may be present in the library and make it possible to resolve the symbol names (e.g. DWARF4 info).

It would be really nice if Tracy could optionally fall back to using libbfd/libdwarf (or something better/faster) to resolve hidden symbols like these, so that profiling would still give meaningful call stacks.

Additionally, when dladdr() cannot find a symbol name, Tracy still records symoff as pc - dlinfo.dli_saddr. But if the symbol name couldn't be resolved, then dli_saddr will be 0, so it will basically just record the pc address instead -- which isn't really that useful after the program has exited. A better option, if dlinfo.dli_sname is NULL, would be to capture pc - dlinfo.dli_fbase in hex (i.e. offset from the base address of the shared object). At least with that information, using something like addr2line <libname> <addr> from binutils could be used to figure out what the unidentified symbols are.

tycho avatar Jun 27 '22 19:06 tycho

Which specific code path are you seeing problems with?

wolfpld avatar Jun 27 '22 20:06 wolfpld

This is the patch I've currently got locally to handle showing the more useful hex offsets I mentioned. The dladdr() on the first line of context for it can't resolve the hidden symbol names:

diff --git a/client/TracyCallstack.cpp b/client/TracyCallstack.cpp
index c59c59d6..cead5449 100644
--- a/client/TracyCallstack.cpp
+++ b/client/TracyCallstack.cpp
@@ -878,8 +878,12 @@ static int CallstackDataCb( void* /*data*/, uintptr_t pc, uintptr_t lowaddr, con
         if( dladdr( vptr, &dlinfo ) )
         {
             symname = dlinfo.dli_sname;
-            symoff = (char*)pc - (char*)dlinfo.dli_saddr;
-            if( ___tracy_demangle( symname, demangled, DemangleBufLen ) ) symname = demangled;
+            if (symname)
+                symoff = (char*)pc - (char*)dlinfo.dli_saddr;
+            else
+                symoff = (char*)pc - (char*)dlinfo.dli_fbase;
+            if( ___tracy_demangle( symname, demangled, DemangleBufLen ) )
+                symname = demangled;
         }
 
         if( !symname ) symname = "[unknown]";
@@ -891,7 +895,7 @@ static int CallstackDataCb( void* /*data*/, uintptr_t pc, uintptr_t lowaddr, con
         else
         {
             char buf[32];
-            const auto offlen = sprintf( buf, " + %td", symoff );
+            const auto offlen = sprintf( buf, " + 0x%tx", symoff );
             const auto namelen = strlen( symname );
             auto name = (char*)tracy_malloc_fast( namelen + offlen + 1 );
             memcpy( name, symname, namelen );

I tried doing a bunch of tricks to try and get dladdr() to give me more than the base address and name of the module, but the only way I found to get it to resolve symbols was by rebuilding the dependency libraries without -fvisibility=hidden (which naturally had the side effect of making them bigger, which is probably fine for just profiling locally but it is a bit of a hassle to do each time I want to run Tracy).

tycho avatar Jun 27 '22 20:06 tycho

The dladdr call here is already a fallback, executed only if libbacktrace fails to find both the function and file name.

wolfpld avatar Jun 27 '22 20:06 wolfpld

Interesting. The state it ends up in, it has the module name and module's address, but no symbol name or symbol address. So if the module name came back as NULL it would do something more useful?

tycho avatar Jun 27 '22 20:06 tycho

Symbol handling is meant to be performed by libbacktrace, which theoretically is maintaned within the gcc tree, to support things such as new DWARF versions, debug data compression, symbol data download from debuginfo server, etc. If this fails, the dladdr path is there only to try to display something more informative than a raw PC.

wolfpld avatar Jun 27 '22 20:06 wolfpld

Oh, so dladdr() is the fallback path? I guess I need to dig further back to figure out how libbacktrace is getting upset.

(Oh, I misread earlier "already has a fallback" rather than "is already a fallback", hence my confusion)

tycho avatar Jun 27 '22 20:06 tycho

It looks like the way it's getting to this point is this:

https://github.com/wolfpld/tracy/blob/1f43cfd2b95feb46125637f26219ffc2c29f6ed3/libbacktrace/dwarf.cpp#L4281-L4283

So I am guessing it is only able to get symbols from libraries that are listed as needed in the main program's ELF binary?

tycho avatar Jun 27 '22 23:06 tycho

It looks like Tracy would need to scan /proc/$pid/maps to find new libraries opened via dlopen(), and rebuild the DWARF index from that.

One ugly part of that is: when do you re-scan? Maybe it's actually cheaper than I am imagining and just re-scanning /proc/$pid/maps can happen every time you fetch a call stack. You'd probably only actually need to re-index DWARF information when the list of file-mapped executable ranges change. But even scanning the maps file seems like it'll be costly especially given the relatively high sampling rate of Tracy. One option would be to just re-scan the maps file on demand, e.g. whenever you unexpectedly hit dwarf.cpp line 4283, but then you have the problem of stale mappings. What if the library we saw mapped there was unloaded and another was loaded in its place? If we operate using the old (wrong) mappings, we'd just give back garbage symbols?

Looking around, it seems like most libraries that generate backtraces are operating on post-mortem crash dumps, and don't have to deal with changed memory mappings or deal with performance considerations.

tycho avatar Jun 28 '22 00:06 tycho

Looks like libbacktrace already is using dl_iterate_phdr to walk the mapped libraries. I found that if I LD_PRELOAD the libraries I will eventually dlopen(), Tracy can find all the symbols. So I wonder if there's some first-time initialization stuff that only happens on the first symbol lookup that needs to be invalidated in order for it to find them.

tycho avatar Jun 28 '22 01:06 tycho

OK, so it only does the full fileline_initialize() once, and if not all the app's libraries have loaded by the time that happens, it misses them and never refreshes. Ideally there would be a backtrace_free_state() or something to allow re-creating the state from scratch but the libbacktrace seems to think the contents of the state should just stick around forever and never change:

   Calling this function allocates resources that cannot be freed.                           
   There is no backtrace_free_state function.  The state is used to                          
   cache information that is expensive to recompute.  Programs are                           
   expected to call this function at most once and to save the return                        
   value for all later calls to backtrace functions.  */                                     
                                                                                             
extern struct backtrace_state *backtrace_create_state (           
...

So yeah, we need to probably walk dl_iterate_phdr periodically, check for differences, have a way to blow away the backtrace_state and re-initialize.

tycho avatar Jun 28 '22 02:06 tycho

This proof-of-concept seems to work, with some big caveats:

https://gist.github.com/tycho/83293355f0b3c5dd6e797d36e7a2ba28

It doesn't consider Mach-O binaries and It leaks the old backtrace_state. It also kind of sucks to expose an additional API instead of making it more transparent (like the initialize functions are). But it does functionally work, and I'm able to get symbols for late-loading libraries consistently. So that's a plus. It just needs a lot of polish.

tycho avatar Jun 28 '22 04:06 tycho

Probably fixed by #674.

wolfpld avatar Mar 23 '24 11:03 wolfpld