rr
rr copied to clipboard
Potential undocumented mmap() limitation: the underlying file must not be changed by other processes
While https://rr-project.org/rr.html mentions the mmap(2)
system call (hint: in Firefox, hit Ctrl-Alt-R to exit the “slide show” mode), it does not mention one limitation that I suspect must exist, based on my understanding of how rr
works. A quick search for mmap
in https://github.com/rr-debugger/rr/wiki did not turn up anything either.
I have implemented an option to use mmap()
for the circular write-ahead log of a database server. The primary motivation for this is to support memory-mapped I/O to a log that is stored in persistent memory (MAP_SYNC
, mount -o dax
), but I implemented it for /dev/shm
as well. With this, database crash recovery works just fine under rr
, and it is much more convenient to debug recovery problems, because memory watchpoints would work over full log file, both before the database server was intentionally killed, and when recovery was run.
We got some strange and incorrect-looking traces when a backup program is being traced while the database server is under write load. The backup process would have a read-only mapping to a log file in /dev/shm
that the database server process is concurrently modifying via its read/write mapping. (If the server is writing a small circular log too quickly, backup may fail to read everything before the log overwrites itself.)
For normal file system operations, rr record
would save into the trace any bytes that a read from a file descriptor returned. I assume that nothing like this is ever done for any data that is being read via mmap()
-assigned virtual memory addresses. My perhaps naïve expectation is that at the time the mmap()
is executed, the entire contents of the memory mapping will be copied to the trace file. The item “mmaps
: metadata about mmap'd files” in https://rr-project.org/rr.html could be interpreted like this. Between mmap()
and munmap()
, nothing about the memory mapped area would have to be written to the trace.
Copying the mmap()
-time contents of the file to the trace would allow the exact contents to be restored on rr replay
even if the memory-mapped area is later modified by the tracee. If this weren’t the case, I think that I should have had trouble replaying mmap
-based database crash recovery tests. Perhaps writing the mmaps
is prone to race conditions by itself too?
I would think that rr
assumes that after an mmap()
call, the file may only be modified by processes that are being traced by the current rr
process. Any concurrent modifications by processes that are not traced by the same rr
invocation could cause unpredictable results, because such modifications would not be reflected by the mmaps
in the trace directory.
Are my assumptions correct? Could this be documented?
I would think that rr assumes that after an mmap() call, the file may only be modified by processes that are being traced by the current rr process.
That's right. rr handles cases where processes inside the trace write to a shared mapping (with the code in EmuFs.h/cc). There's no way for rr to observe writes to a MAP_SHARED mapping that originate from a process that is outside the trace though.
There is a bit in that presentation "rr doesn't (can't efficiently) record reads/writes of memory shared outside of tracee application", which covers this, though not with any technical detail.
Are there any technical workarounds? Perhaps a dumb suggestion to get the creative juices flowing - how about encapsulating the startup of various applications under a single parent binary where the child processes of such a binary are also traced by rr
?
Sure, if you can get all the processes recorded by rr --- e.g. by recording a single parent process and its offspring --- things are good. I don't know if that needs to be documented as a workaround specifically.
Perhaps something along the lines of "how to trace multiple process by creating a parent [shell?] process". And perhaps such a parent process can be easily created by using some shell script/call of all required processes?
When it comes to MariaDB/server@685d958e38b825ad9829be311f26729cccf37c46 the work-around is to ensure that mariadb-backup --backup
is either never invoked under rr record
, or to ensure that the mmap
interface for reading the concurrently running server’s log will not be invoked, either by storing the log outside /dev/shm
or any mount -o dax
file system, or by patching the source code. Even better would be a redesign, to have the database server process stream its log to the backup. I always found the inter-process communication via the file system problematic, and I am surprised that it works so well in practice, also under concurrent rr record
invocations that are avoiding this limitation.