rr icon indicating copy to clipboard operation
rr copied to clipboard

Improving thread creation performance

Open rocallahan opened this issue 1 year ago • 6 comments

I'm looking at a large application which creates absurd numbers of threads during startup. I'd like to reduce the overhead of this code under rr. Measuring the thread_stress test suggests that creating a thread that does practically nothing but exit takes about 5ms under rr on my machine.

The key problem seems to be the work we have to do to get syscallbuf set up for each thread. This is basically:

  • 2 round trips to rr for gettid enter/exit (syscallbuf)
  • 2 round trips to rr for perf_event_open enter/exit (syscallbuf)
  • 2 round trips to rr for fcntl dup enter/exit (syscallbuf)
  • 2 round trips to rr for rrcall_init_buffers (syscallbuf)
    • 3 round trips to tracee (recvmsg, mmap, close) for Session::create_shared_mmap
    • 4 round trips to tracee (sendmsg desched fd, recvmsg/dup3/close clonedata fd) for RecordTask::init_buffers_arch

That's 15 round trips, i.e. 30 context switches, in code we control.

rocallahan avatar Jan 02 '23 05:01 rocallahan

I think ideally we'd run a lot of code in the tracee that does non-traced syscalls to set everything up, and then does one traced syscall similar to rrcall_init_buffers that passes all required data to rr. I.e.:

  • Untraced gettid/perf_event_open/fcntl-dup syscalls to set up desched fd correctly
  • Untraced syscalls to create and map the shared syscallbuf mem fd
  • Untraced syscalls to create the clonedata fd
  • Traced rrcall_init_buffers2 providing all 3 fds to rr
    • rr uses pidfd_getfd to acquire its copies of the fds, instead of doing roundtrips to the tracee

That would collapse those 15 roundtrips to just 2.

pidfd_getfd appears in Linux 5.6. We can fall back to the "puppet sendmsg" path for older kernels, which would require 3 extra roundtrips, but it would still be quite good.

rocallahan avatar Jan 02 '23 05:01 rocallahan

Of course to get that to work we'd have to arrange for the replay code to match the recording code "closely enough". I need to think about that some more. We might want to use a replay-only syscall at the start of the init sequence to get the values to use as the results of those untraced syscalls.

rocallahan avatar Jan 02 '23 05:01 rocallahan

Another issue is that we'd have to maintain compatibility with existing syscallbuf recordings. Not sure how much extra work that would be.

rocallahan avatar Jan 02 '23 05:01 rocallahan

The approach I'm planning to try: issue a series of privileged, untraced syscalls to set up the desched fd, the clonedata fd, and a MAP_ANONYMOUS MAP_SHARED mmap for the syscallbuf, then issue a traced rrcall_init_buffers2 with a record containing the fds and the address/size of the syscall buffer. We'll store the desired syscallbuf size and clonedata-enabled flag in the preload_globals. rr can open the syscallbuf and map it using /proc/<tid>/map_files/<start>-<end>, the tracee never needs an fd for it. rrcall_init_buffers2 will update the parameters buffer and the tracee will re-fetch the desched fd, clonedata fd and syscallbuf address from that buffer.

During replay all the untraced syscalls will be noops but we won't do any conditional branches on their results. The rrcall_init_buffers2 handling in rr will initialize the output buffer appropriately so the correct values are obtained when the tracee refetches fds etc.

This means during replay control flow will match recording, but various data values will not until we reach rrcall_init_buffers2. I'll try to make sure those data values don't leak past that point.

rocallahan avatar Jan 16 '23 07:01 rocallahan

https://github.com/rr-debugger/rr/tree/threads has code that works --- passes tests, avoids the roundtrips. However on a test with 2000 threads that start and don't exit, we're still pretty slow: the patch improves things from 37.1s to 26.4s. (No-rr baseline is 0.1s.)

Profiling seems to show that we spend a lot of time in find_free_file_descriptor and AddressSpace::find_free_memory. Guess I'll see if I can fix those. If I allow the threads to exit everything is much faster (4.5s with the threads patch) so I think we're just not scaling as the number of syscall buffers and open file descriptors supporting syscallbuf increases.

rocallahan avatar Feb 07 '23 06:02 rocallahan

With those optimized plus another couple of optimizations, the time to create 2000 threads is down to 8s. Basically when there are thousands of threads doing things that are O(number of threads) are a problem. The profile is quite flat now. We call Scheduler::is_task_runnable a lot but it would be a lot of work to make that much better.

rocallahan avatar Feb 08 '23 21:02 rocallahan