rr icon indicating copy to clipboard operation
rr copied to clipboard

Potential undocumented mmap() limitation: the underlying file must not be changed by other processes

Open dr-m opened this issue 3 years ago • 5 comments

While https://rr-project.org/rr.html mentions the mmap(2) system call (hint: in Firefox, hit Ctrl-Alt-R to exit the “slide show” mode), it does not mention one limitation that I suspect must exist, based on my understanding of how rr works. A quick search for mmap in https://github.com/rr-debugger/rr/wiki did not turn up anything either.

I have implemented an option to use mmap() for the circular write-ahead log of a database server. The primary motivation for this is to support memory-mapped I/O to a log that is stored in persistent memory (MAP_SYNC, mount -o dax), but I implemented it for /dev/shm as well. With this, database crash recovery works just fine under rr, and it is much more convenient to debug recovery problems, because memory watchpoints would work over full log file, both before the database server was intentionally killed, and when recovery was run.

We got some strange and incorrect-looking traces when a backup program is being traced while the database server is under write load. The backup process would have a read-only mapping to a log file in /dev/shm that the database server process is concurrently modifying via its read/write mapping. (If the server is writing a small circular log too quickly, backup may fail to read everything before the log overwrites itself.)

For normal file system operations, rr record would save into the trace any bytes that a read from a file descriptor returned. I assume that nothing like this is ever done for any data that is being read via mmap()-assigned virtual memory addresses. My perhaps naïve expectation is that at the time the mmap() is executed, the entire contents of the memory mapping will be copied to the trace file. The item “mmaps: metadata about mmap'd files” in https://rr-project.org/rr.html could be interpreted like this. Between mmap() and munmap(), nothing about the memory mapped area would have to be written to the trace.

Copying the mmap()-time contents of the file to the trace would allow the exact contents to be restored on rr replay even if the memory-mapped area is later modified by the tracee. If this weren’t the case, I think that I should have had trouble replaying mmap-based database crash recovery tests. Perhaps writing the mmaps is prone to race conditions by itself too?

I would think that rr assumes that after an mmap() call, the file may only be modified by processes that are being traced by the current rr process. Any concurrent modifications by processes that are not traced by the same rr invocation could cause unpredictable results, because such modifications would not be reflected by the mmaps in the trace directory.

Are my assumptions correct? Could this be documented?

dr-m avatar Jan 20 '22 17:01 dr-m

I would think that rr assumes that after an mmap() call, the file may only be modified by processes that are being traced by the current rr process.

That's right. rr handles cases where processes inside the trace write to a shared mapping (with the code in EmuFs.h/cc). There's no way for rr to observe writes to a MAP_SHARED mapping that originate from a process that is outside the trace though.

There is a bit in that presentation "rr doesn't (can't efficiently) record reads/writes of memory shared outside of tracee application", which covers this, though not with any technical detail.

khuey avatar Jan 20 '22 17:01 khuey

Are there any technical workarounds? Perhaps a dumb suggestion to get the creative juices flowing - how about encapsulating the startup of various applications under a single parent binary where the child processes of such a binary are also traced by rr?

mariadb-RoelVandePaar avatar Feb 21 '22 02:02 mariadb-RoelVandePaar

Sure, if you can get all the processes recorded by rr --- e.g. by recording a single parent process and its offspring --- things are good. I don't know if that needs to be documented as a workaround specifically.

rocallahan avatar Feb 21 '22 03:02 rocallahan

Perhaps something along the lines of "how to trace multiple process by creating a parent [shell?] process". And perhaps such a parent process can be easily created by using some shell script/call of all required processes?

mariadb-RoelVandePaar avatar Feb 21 '22 06:02 mariadb-RoelVandePaar

When it comes to MariaDB/server@685d958e38b825ad9829be311f26729cccf37c46 the work-around is to ensure that mariadb-backup --backup is either never invoked under rr record, or to ensure that the mmap interface for reading the concurrently running server’s log will not be invoked, either by storing the log outside /dev/shm or any mount -o dax file system, or by patching the source code. Even better would be a redesign, to have the database server process stream its log to the backup. I always found the inter-process communication via the file system problematic, and I am surprised that it works so well in practice, also under concurrent rr record invocations that are avoiding this limitation.

dr-m avatar Feb 21 '22 07:02 dr-m