Add source location to task and tasktrace object
Adds source location of next resume point to task objects. When constructing tasktrace object, source location of resume points will be added to it. When printing, the source location of next resume point will be printed as well, hopefully improving debug experience.
Example:
0x2b3899 /home/y/work/scylladb-test/build/Debug/seastar/libseastar.so+0x454c364 ...
--------
test_app/main.cpp:15:15
seastar::continuation<seastar::internal::promise_base_with_type<void>, ...
Partially fixes https://github.com/scylladb/seastar/issues/2381
Aren't those addresses derivable from the stack trace? Or are those points lost?
How large is std::source_location?
Aren't those addresses derivable from the stack trace? Or are those points lost?
Probably with a little bit of magic. You'd still have to store a pointer somewhere, because stacktrace is valid for an actually running task, while my patch adds "next instruction" to tasks, which are suspended.
How large is std::source_location?
8 bytes.
Aren't those addresses derivable from the stack trace? Or are those points lost?
Probably with a little bit of magic. You'd still have to store a pointer somewhere, because stacktrace is valid for an actually running task, while my patch adds "next instruction" to tasks, which are suspended.
How large is std::source_location?
8 bytes.
Maybe we can tolerate it. @tgrabiec what's your opinion?
Please run scylladb's perf-simple-query and report insn-per-op before and after (although we won't learn of the impact of the size change).
my patch
instructions_per_op: mean=48465.87 standard-deviation=33.03 median=48459.42 median-absolute-deviation=22.63 maximum=48520.81 minimum=48437.81
instructions_per_op: mean=48482.90 standard-deviation=21.16 median=48487.16 median-absolute-deviation=6.97 maximum=48504.13 minimum=48447.34
instructions_per_op: mean=48469.42 standard-deviation=18.34 median=48472.45 median-absolute-deviation=17.26 maximum=48488.53 minimum=48448.81
master
instructions_per_op: mean=48337.41 standard-deviation=21.29 median=48334.08 median-absolute-deviation=19.67 maximum=48364.48 minimum=48317.02
instructions_per_op: mean=48352.36 standard-deviation=27.87 median=48347.35 median-absolute-deviation=16.28 maximum=48395.26 minimum=48322.29
instructions_per_op: mean=48339.83 standard-deviation=26.48 median=48323.12 median-absolute-deviation=20.80 maximum=48372.64 minimum=48319.03
My patch will run more instructions, as there are additional stores to the task object (std::source_location needs to be written).
Why do we measure instructions per operation? We are not billed by instruction count, rather by time, we should measure total running time instead? Does it work as a sanity check or am i missing something?
Why do we measure instructions per operation? We are not billed by instruction count, rather by time, we should measure total running time instead? Does it work as a sanity check or am i missing something?
Time is very unstable (just try it) since it depends on temperature and cooling.
Instructions is a more stable proxy for time (but inaccurate since instructions-per-clock can change).
So we add 0.25% overhead. Hard to judge if it's worthwhile.
Time is very unstable (just try it) since it depends on temperature and cooling.
Agree.
Instructions is a more stable proxy for time (but inaccurate since instructions-per-clock can change).
It's absolutely stable, but we're counting steps, which - in my opinion - sort-of blinds us. It's like two different walks, the same step count, but step sizes differ, so is distance. Here for example if we're running at max IPC, then we've 0.25% overhead. If otherwise we're waiting for memory, then the write is pretty much free (we've already waited for this cache line, because other stuff is read / written there) and proc is waiting for other memory pieces, so it can do our write.
Time is very unstable (just try it) since it depends on temperature and cooling.
Agree.
Instructions is a more stable proxy for time (but inaccurate since instructions-per-clock can change).
It's absolutely stable, but we're counting steps, which - in my opinion - sort-of blinds us. It's like two different walks, the same step count, but step sizes differ, so is distance. Here for example if we're running at max IPC, then we've 0.25% overhead. If otherwise we're waiting for memory, then the write is pretty much free (we've already waited for this cache line, because other stuff is read / written there) and proc is waiting for other memory pieces, so it can do our write.
Writes are more-or-less free, except for the increase in instruction footprint (which are extra reads).
We aren't close to running at max IPC. Typical for mini-benchmarks like perf-simple-query is 2, and for production ~1. The main killer is waiting for instruction fetch (although maybe in production mispredicts would also contribute since the the data is less regular). I'm sure an M4 would give much higher IPC.
if the performance impact could be a concern, can we make it an optional feature which is enabled only in the debug builds?
This has been discussed here (https://github.com/scylladb/seastar/issues/2381) and consensus is that the release is where it's useful the most.
Aren't those addresses derivable from the stack trace? Or are those points lost?
Probably with a little bit of magic. You'd still have to store a pointer somewhere, because stacktrace is valid for an actually running task, while my patch adds "next instruction" to tasks, which are suspended.
How large is std::source_location?
8 bytes.
Maybe we can tolerate it. @tgrabiec what's your opinion?
I think it's probably ok.
@scylladb/seastar-maint please merge.
So we add 0.25% overhead. Hard to judge if it's worthwhile.
@avikivity I think the gain is much higher than 0.25% here. We need to reduce the bar of entry into debugging seastar applications.
@denesb can you tell me, how did you figure out performance differences for my patch (or any patch)? Sub 1% precision looks amazing to me, how did you do that?
@denesb can you tell me, how did you figure out performance differences for my patch (or any patch)? Sub 1% precision looks amazing to me, how did you do that?
I didn't, I just reacted to Avi's comment. I didn't claim that the perf regression is something else, just that 0.25% that Avi claimed is worth it for the gain of seeing line numbers.
@denesb can you tell me, how did you figure out performance differences for my patch (or any patch)? Sub 1% precision looks amazing to me, how did you do that?
We're talking about the perf_simple_query results posted earlier in this thread.
@scylladb/seastar-maint shameless ping - please merge
@scylladb/seastar-maint shameless ping - please merge.
@scylladb/seastar-maint shameless ping - please merge.
@avikivity and @tgrabiec are both maintainers and were involved in this review, I think they should get the honor of deciding to merge it.
Since this is the first time I'm seeing this issue, I have a beginner's question - where are all those fairly big "source location" strings saved in the executable (I'm not talking about the 8-byte pointers, which are also a problem - I'm talking about the actual strings). Can you estimate how much the stripped Scylla executable will grow due to these strings? I don't have any ideas if there are hundreds such strings, or hundreds of thousands.
Since this is the first time I'm seeing this issue, I have a beginner's question - where are all those fairly big "source location" strings saved in the executable (I'm not talking about the 8-byte pointers, which are also a problem - I'm talking about the actual strings). Can you estimate how much the stripped Scylla executable will grow due to these strings? I don't have any ideas if there are hundreds such strings, or hundreds of thousands.
We will get one std::source_location data per suspention point. Assuming we have 1k suspention points in the code base and each in different file (we don't have that many files) and each path is 1k long (my paths are around 100 - 150 bytes) we would get max of 1mb of pure text. This text will be mapped to memory, but not paged in until someone uses, so it should be no issue for anyone but storage guys.
In reality i'd expect below 100kb (100 different files with 200 bytes per path give - 20 kb of text for example).
There's a gratuitous merge commit at the end.
@travisdowns please opine on whether this would be terrible for performance for you, or great for debugging.
There's a gratuitous merge commit at the end.
@travisdowns please opine on whether this would be terrible for performance for you, or great for debugging.
@avikivity If we don't want to pessimize existing applications, we can always hide the stuff added in this patch behind the preprocessor. But please don't just let this patch rot. Let me have it at least in dev mode for starters. I've wasted so much time over the years on tests where the main debugging difficulty was finding out where a continuation came from.
We will get one
std::source_locationdata per suspention point. Assuming we have 1k suspention points in the code base and each in different file (we don't have that many files) and each path is 1k long (my paths are around 100 - 150 bytes) we would get max of 1mb of pure text.
In addition to file names, there are function names, and those (with namespaces, argument type names, etc.) can be even longer and more numerous.
Could you do me a favor and compile Scylla (for example) with and without your patch and see how much bigger the stripped Scylla executable became? Maybe the difference is indeed tiny, I am not suggesting that it isn't - I just don't know.
@nyh sure any particular version? Otherwise i will build dev.
@nyh sure any particular version? Otherwise i will build dev.
No, whatever you want. The idea is to take some "big" Seastar project and see the impact on this change on it, it doesn't matter exactly which version of which build mode (but I wouldn't take "debug" build mode because that already inflates the executable in strange ways).
There's a gratuitous merge commit at the end. @travisdowns please opine on whether this would be terrible for performance for you, or great for debugging.
@avikivity If we don't want to pessimize existing applications, we can always hide the stuff added in this patch behind the preprocessor.
That's the easy way out, but I don't want to take it.
But please don't just let this patch rot. Let me have it at least in dev mode for starters. I've wasted so much time over the years on tests where the main debugging difficulty was finding out where a continuation came from.
That's reasonable. Let's see if we can push it to release too.
@nyh wrote:
where are all those fairly big "source location" strings saved in the executable (I'm not talking about the 8-byte pointers, which are also a problem - I'm talking about the actual strings).
They are stored in .rodata, and the std::source_location is effectively just a single pointer to this read-only data. So it is pretty efficient in the sense that if you don't access the string/line data itself, you will not even bring it into cache. In some way it's more efficient than old-school __FILE__ and __LINE__ passing around, which usually needs an int and a const char * (though you could concat them in the preprocessor).
I stand corrected. scylla binary in dev mode has grown by 1.7 MB (from 125.41 to 127.13), approximately 1.72%.
EDIT: std::source_location contains also function names. Which, given their templated nature, can grown way beyond 200 bytes i assumed.
Your patch is very impressive, and I don't have an opinion whether the 0.3% slowdown and the 1.7% executable size increase is a legitimate price to pay or not, I think other reviewers have had stronger opinions on this so I"ll let them decide.
I just want to note that in the past, the tradition wasn't to include the function names, file names, and line numbers in the executable itself - instead, backtraces used only an instruction pointer and nothing more, and an external tool (addr2line) was used to convert such instruction addresses to source code's filename/function-name/line-number - using "debug information" stored in an unstripped version of the executable. I think your patch isn't the first one that undoes this old tradition - we already know how to print backtraces with symbolic names and not just addresses - it just completes the process. Maybe it's a welcome change, because using addr2line was never very convenient.
Weren't the tradition dictated by a need to be more obscure due to a partially closed source nature of past scyllas?
Weren't the tradition dictated by a need to be more obscure due to a partially closed source nature of past scyllas?
No, I wasn't talking about Scylla's tradition, but in general - addr2line isn't specific to Scylla or closed source. In any case, addr2line does not need the source code - just an executable with extra (or separate) "debugging information" (https://en.wikipedia.org/wiki/DWARF) which helps convert instruction-pointer (a.k.a. PC) addresses to file name, line number and function name. As I said, I agree that storing all this stuff in the executable and allowing the executable to introspect it is certainly more convenient than addr2line, but in the old days it wasn't done because it has a cost, as you noticed. But we can definitely decide that the convenience is worth the small cost.