rr Implement persistent checkpoints

Implement persistent checkpoints

Open rocallahan opened this issue 6 years ago • 40 comments

We could speed up the debugging process by allowing the trace to contain persistent checkpoints. Here are some features we could support:

A rr create-checkpoints command that does a full replay of a trace and periodically saves checkpoints of the state into the trace directory.
Make rr replay -g <N> start replay from the most recent persisted checkpoint.
Make rr replay load persisted checkpoints during reverse execution.
Make rr record optionally save checkpoints periodically into the trace directory to speed up the above operations.

The minimal implementation that would be useful would be the first two features. Other features could be added on top of that.

The main issues are the mechanics of saving and restoring checkpoints and the format of the persisted checkpoint data.

Apr 21 '18 20:04 rocallahan

Thanks -- what ideas do you have in saving all the application state (registers, mmaped files, file descriptors etc. all in one go rather than a succession of deltas like it is currently)? How will that work conceptually with reference to the current codebase? (just some broad outlines -- I will dig deeper once you suggest a possible approach)

Apr 22 '18 03:04 sidkshatriya

rr already has a notion of checkpoints. At the Session API level, ReplaySession::clone returns a ReplaySession that serves as a checkpoint. The returned session is "partially initialized", i.e. the thread states are stored in Session::clone_completion. We should implement persistent checkpoints by creating an API on ReplaySession that clones the current session and then writes the contents of CloneCompletion to a file. Serializing Task::CapturedState should be easy.

The tricky bit is serializing the contents of the address spaces held by AddressSpaceClone::clone_leader. A naive approach would be to walk the mappings of each clone_leader's AddressSpace and simply dump the contents into the checkpoint file along with the metadata for each mapping. That would probably be OK, but we could easily do a lot better by avoiding writing to the checkpoint file any pages that we know are equal to the corresponding pages of the mmap_ file in the trace they were mapped from.

The main part of the checkpoint should be written using a new Capnproto type. The Capnproto data can be followed by the block data itself. Everything should be compressed using brotli and I think we can get all the data into a single file.

Apr 23 '18 02:04 rocallahan

For restoring checkpoints I think we need a static function in ReplaySession that's like create but takes two parameters: a trace_dir and a file descriptor for the opened checkpoint file, returning a new ReplaySession in the "partially initialized" state. It would deserialize the data, initialize the clone_leader tasks and create their memory mappings. It's a bit tricky but it would resemble how process_exec works in replay_syscalls.cc: exec an rr_exec_stub(_32), remove its existing mappings, and create new ones.

Apr 23 '18 02:04 rocallahan

Let's call that new function load_checkpoint.

I think it would be best if load_checkpoint actually calls ReplaySession::create to create a new ReplaySession and then clones that initial task to create each clone_leader. That would avoid having to modify or duplicate the (complicated) session startup logic.

Apr 23 '18 02:04 rocallahan

General update: I have started looking at the code you pointed out above. I'll definitely ask you questions as they come up. Right now I'm trying to achieve a basic understanding of the rr codebase and doing some stepping through of rr source (in gdb) to get a feel of things.

Apr 28 '18 06:04 sidkshatriya

Question:

What exactly is a leader (clone leader, task leader, group leader)?
A session tracks multiple tasks. Is a session basically tracking everything in a process (pid)?

In general the nomenclature: session, task, task group, leaders, group leaders, tid/pid etc. is a bit confusing as these terms are generally quite overloaded in system programming lingo. Is there is a glossary or list of definitions I can check out somewhere that would explain what these mean in the context of rr code?

Apr 28 '18 08:04 sidkshatriya

When cloning a ReplaySession, for each address space, we select one task (group_leader) that's using that address space and cause it to fork(), making a copy of that address space. Those cloned tasks are the clone_leaders.

task_leader isn't really a thing. The comment mentioning it needs to be updated.

Is a session basically tracking everything in a process (pid)?

No, a session tracks an entire set of processes and their threads. When you record, everything in one recording is a RecordSession. Replaying with rr replay -a uses exactly one ReplaySession.

Generally we try pretty hard to make our terms match exactly the way they're used in the kernel. Hence Task, ThreadGroup (not 'task group', which is something obscure that rr doesn't care about), tid. We don't use `pid' much because it's actually a quite ambiguous concept in Linux; 'thread group id' is more specific. The 'leader' idea is rr-specific but hopefully I've explained it adequately above.

Apr 28 '18 08:04 rocallahan

So:

tid == Task == task struct in linux and tgid == ThreadGroup.
group_leader is any tid out all tids under a specific tgid (as all tgid's share address spaces). If we have n different tgids then we will have 1 group leader per different tgid.

Hope I've got this correct!

Apr 28 '18 10:04 sidkshatriya

Not quite. All threads in a thread-group share the same address space, but it is also possible for different thread-groups to share the same address space thanks to the magic of CLONE_VM.

Apr 28 '18 10:04 rocallahan

Thanks!

Question: Looking at the code, when you are doing checkpointing you are essentially doing a remote fork() call (in the tracee). This way you maintain the system state at that location as the fork() does a COW, correct? What impact does this have on "ticks", the number of RCBs? Does the checkpointed session (or process??) get the exact same ticks when it does a fork?

Apr 28 '18 11:04 sidkshatriya

Yes.

We reset the hardware tick count every time we resume execution. The only ongoing tick count is the one we keep in Task, and we copy that when we clone each Task.

Apr 28 '18 12:04 rocallahan

We reset the hardware tick count every time we resume execution.

When you say resume execution what does that exactly mean in this context. Resume execution after a syscall, resume execution from a checkpoint, or some other definition of resuming...

Also practically speaking, the gdb checkpoint command should trigger a ReplaySession::clone, correct? So it that sense when the gdb client asks for a checkpoint, the rr backend creates it, correct?

Now I also understand that rr uses checkpointing to home into a (reverse-continue) breakpoint. rr needs to keep finding a series of runs, starting at more and more advanced points of program execution, in which a user initiated breakpoint is hit, is hit, is hit and then (finally) is not hit. (It will just take the one before the not hit) scenario. So it will use various checkpoints created during this process to start running the execution from, correct?

So essentially my question is: is the gdb concept of checkpointing and the internal use of checkpointing in rr to find the appropriate moment to trigger the breakpoint in the debugger uses the same underlying rr ReplaySession::clone feature?

Apr 28 '18 12:04 sidkshatriya

When you say resume execution what does that exactly mean in this context

Task::resume_execution. That runs every time we run any code in the tracee process via PTRACE_CONT or one of its variants --- user code, syscalls, whatever.

So it that sense when the gdb client asks for a checkpoint, the rr backend creates it, correct?

Yes, we override the checkpoint command to clone ReplaySession.

So it will use various checkpoints created during this process to start running the execution from, correct?

Yes. That's all controlled by ReplayTimeline.

is the gdb concept of checkpointing and the internal use of checkpointing in rr to find the appropriate moment to trigger the breakpoint in the debugger uses the same underlying rr ReplaySession::clone feature?

Yes.

Apr 28 '18 12:04 rocallahan

That would probably be OK, but we could easily do a lot better by avoiding writing to the checkpoint file any pages that we know are equal to the corresponding pages of the mmap_ file in the trace they were mapped from.

I understand this statement conceptually: we want to avoid writing redundant stuff. Can you explain the mmap_ file steps a bit. I understand that that rr saves the mappings upon recording and then successive stores deltas but the whole mmap_ steps explanation would be useful for greater understanding/context.

Apr 28 '18 14:04 sidkshatriya

The logic for deciding what to do for a mmap is all in TraceWriter::write_mapped_region. For certain files (like system libraries) we try to be efficient and clone (if the file system supports it) or hardlink the file into the trace. Neither of these operations use any (appreciable amount of) disk space compared to actually copying the file's contents into the trace.

Some pages of these files will later be changed in memory by the program (e.g. shared libraries after relocations are processed by ld.so). To support efficient checkpointing we could keep track of which pages are ever PROT_WRITE and avoid storing them in the checkpoint (because if they were never PROT_WRITE they can't have changed).

Apr 28 '18 15:04 khuey

I've been reading the extended rr technical report from ArXiv. Thank you for such a great technical summary of how rr works.

As part of my familiarization with rr internals I have a question from that paper.

3.3 Detecting Blocked System Calls

[...] Meanwhile the read system call in T is restarted and treated much like a regular non-intercepted system call. Its parameters are rewritten to use scratch buffers. T is re- sumed using PTRACE SYSCALL so that the thread will stop when the system call exits. After the system call exits, T will eventually be rescheduled. At that point the system-call exit will be recorded.

The read() system call is restarted with a PTRACE_SYSCALL (because we don't want to in-process recording as it was found to be blocking). Now as mentioned earlier, this read() could be blocking (maybe waiting for a write() on a pipe). So this should result in the thread being de-scheduled on the processor right away. I don't understand the comment "that the thread will stop when the system call exits". The thread has already been stopped i.e. descheduled waiting for the read() call to complete (as its waiting on a write()) isn't it?

Apr 29 '18 06:04 sidkshatriya

Descheduling the thread from the CPU does not count as a ptrace stop. From ptrace's point of view, the thread is still "running". In ptrace terms the thread doesn't stop until the syscall exits.

Apr 29 '18 11:04 rocallahan

Thanks -- so the next ptrace-stop happens when the read() call exits. De-scheduling of a process does not cause a ptrace-stop (unless there is a perf event).

So the next question is: how do you prevent a deadlock in the pipe example we have been discussing. The read() call is blocking, ptrace still thinks the thread is running. rr would need to start running another thread which has the corresponding call to write() (to unblock the thread containing read system call) while maintaining the invariant that only 1 thread is running at any given time. How can rr schedule another thread to run while the thread containing read() is "running" i.e. blocked (but to ptrace it means the same thing)?

Apr 29 '18 17:04 sidkshatriya

That's what the desched event is for. We set up a desched event fd to send a signal to the tracee when it is descheduled. That signal delivery causes a ptrace trap that rr can see.

Apr 29 '18 17:04 khuey

I thought the desched event was only to prevent deadlocks during in-process syscall recording. In fact the paper describes how the desched event prevents such a deadlock when the read() call is not being ptraced. It then describes how the read() system call is restarted using PTRACE_SYSCALL. The question is is how do you prevent it from blocking this time over?

Apr 29 '18 17:04 sidkshatriya

Oh, sorry, I misunderstood your question. On the normal ptrace path blocking syscalls return ALLOW_SWITCH from rec_prepare_syscall_arch. That triggers a different path in the scheduler (grep it for PREVENT_SWITCH) which ends up checking is_task_runnable on the current task. That'll do a nonblocking waitpid and if that doesn't return a new ptrace event we'll switch tasks.

Apr 29 '18 17:04 khuey

I have been studying the ReplaySession::clone call in detail. While stepping through the debugger, one can see a remote clone syscall being made. The remote registers for the syscall are setup by doing a flush_regs() call. Then a PTRACE_SINGLESTEP is issued. What is interesting is that that ip() is at 0x7000 000F when the ptrace is issued. This is the syscall instruction in the rr_page_64_replay page so it all works out and it makes sense. But, what is surprising to me is... where/how does the ip() get set to 0x7000 000F? I can understand that it needs to be at that address but we could be anywhere in the tracee codebase. How do we manage to be set at the correct address for the syscall instruction to run?

Apr 30 '18 18:04 sidkshatriya

Look at AutoRemoteSyscalls::setup_path.

Apr 30 '18 18:04 khuey

I'm trying to understand ReverseTimeline a bit. However the code is pretty dense in there and I'm confused a bit by Marks, ProtoMark and InternalMark. Can someone give me a short (or long :-) ) explanation of terminology and basic concepts here?

May 03 '18 20:05 sidkshatriya

Read all the comments in ReplayTimeline.h and for whatever is still not clear, we should expand those comments.

May 03 '18 22:05 rocallahan

  /**
   * All the information we'll need to construct a mark lazily.
   */
  struct ProtoMark {

A couple of sentences about the motivation of creating marks lazily. A brief explanation would be useful. Also Marks contain InternalMarks which contain ProtoMarks -- its not clear from this encapsulation how ProtoMarks would be a "lazy" version of a mark.

  /**
   * MarkKey + Registers are assumed to identify a unique program state.
   * We can't order these states directly based on this data, so we have to
   * record the ordering in the ReplayTimeline.
   */
  struct InternalMark {

This comment could be elaborated a bit more -- its a bit cryptic/minimalist at the moment.

Essentially ReplayTimeline is complex because of the various kinds of marks and various ways of measuring progress and various "keys". Your explanation of clone in https://github.com/mozilla/rr/issues/2184#issuecomment-383438010 was short but extremely useful. Using that as a jump off point I was able to explore the code in a debugger and I now understanding session and task cloning quite well . A similar overview explanation would be lovely for ReplayTimeline too!

May 04 '18 15:05 sidkshatriya

Also conceptually what this this map "mean":

  std::map<MarkKey, std::vector<std::shared_ptr<InternalMark>>> marks;

MarkKey represents Frame time + ticks. InternalMark contains ProtoMark which itself stores a MarkKey and registers. So I guess what I want to know is how can one MarkKey correspond to many InternalMarks.

May 04 '18 20:05 sidkshatriya

Hopefully this helps: https://github.com/mozilla/rr/commit/0beec8c8d566e0bfb3f05abb86f5459a9a5fe366

May 05 '18 02:05 rocallahan

Thank you so much for the comments. It makes sense now!

May 05 '18 18:05 sidkshatriya

I have a question about Section 3 - In-process system call interception in the extended technical report

(Section 3) [...] Therefore we inject into the recorded process a library that intercepts common system calls, performs the system call without triggering a ptrace trap [...]

So I understand this interception library to be librrpreload.so that is inserted via LD_PRELOAD.

In Section 3.1

[...] In practice, we have found that method to be insufficient, due to applications making direct system calls, and fragile due to variations in C libraries, and applications that require their own preloading. Instead, when the tracee makes a system call, RR is notified via a ptrace trap and it tries to rewrite the system-call instruction to call into our interception library.

(Emphasis mine)

The statements in Section 3 and Section 3.1 seem a bit confusing when put together. What do you mean by "common" system calls in Section 3? Are they are all the system calls except the ones that "escape" LD_PRELOAD because of (a) direct system calls (b) variations in C libraries (c) applications that require their own preloading ? Or are there some explicit and common calls simply enumerated and then overridden in librrpreload.so.

The reason that this is important is that you're facing a penalty of 2 context switches for these calls if they raise a ptrace event to rr.

Here is the algorithm as I understand it: System call happens Case A: System call was intercepted by LD_PRELOAD Allow/deny of the system call happens via seccomp-bpf. If "allow" then no ptrace event. If "deny" then ptrace event happens. Case B: System call was not intercepted by LD_PRELOAD Ptrace event is raised. Rewrite via ptrace for system call to go via interception library. i.e. Case A.

In other words, is there an explicit enumeration of system calls intercepted in Case A or is it just the ones that "escape" due to (a) direct system calls (b) variations in C libraries (c) applications that require their own preloading.

May 07 '18 07:05 sidkshatriya

rr rr copied to clipboard

Implement persistent checkpoints

rr
rr copied to clipboard