measureme Unbuffered event stream.

To avoid hitting a Mutex for every event being recorded, and assuming that all the other streams have (much) lower frequency, we could keep the current format, while recording it differently:

after the file header, and after any non-events page, start an events page (i.e. leave enough space for a page header)
writing events happens ("atomically"), just like writing whole pages
- requires the backing storage to be a mmap'd file, we don't want to actually hit a syscall here
  - we also need this for the ability to track what position we are at (using an AtomicUsize)
- should work as we're (i.e. the atomic position is) effectively always inside an "open" events page
writing a non-events page (as well as finishing the file) has to write the correct length in the "currently open" events page, for which we'd have to:
- first grab a global (per backing store) lock, to avoid page-writing races with other non-events streams
- claim the range for the new page to write using fetch_add on the same AtomicUsize position (that writing events uses), effectively "closing" the "currently open" events page (new events would effectively end up in a newer page)
- use the "start of the events page" (from the global lock), and the "start of the new non-events page" (claimed above) to compute the correct length to use in the header of the just-closed events page, and to write that header
- update the "start of the events page" (in the global lock) to just after the end of the newly claimed non-events page
- release the lock and start writing the new page (in the claimed range)
  - nothing else will access that range so there's no need to keep the lock any longer

Sadly bypassing File entirely will probably be most of the work here, I don't fully understand why measureme never properly mmap'd a file-on-disk (the pre-paging mmaps were "anonymous", i.e. no different than ungrowable Vec<u8>s).

I'm also not sure how we'd handle not being able to grow the mmap'd region (not without re-adding locking to writing events), we need to be on a 64-bit host to even grab gratuitous amounts of virtual memory, without hampering the rest of the process, don't we? Maybe for 32-bit hosts we can keep paging events and writing to a File?

cc @michaelwoerister @wesleywiser

Nov 06 '20 11:11 eddyb

There were also some ideas on Zulip around using "non-temporal" (i.e. cache-bypassing) stores to the mmap'd file, in order to avoid polluting the cache with data only the kernel should ever read again (if at all, maybe AHCI/NVMe would read it themselves from RAM through DMA), but that's farther off, and would require a scheme like this to even be plausible.

Nov 06 '20 11:11 eddyb

Sadly bypassing File entirely will probably be most of the work here, I don't fully understand why measureme never properly mmap'd a file-on-disk

For reference, there was a (half-hearted) attempt to do that in https://github.com/rust-lang/measureme/pull/16/files. It was abandoned due to lack of time and because existing SerializationSink implementations were deemed good enough for the time being.

Nov 10 '20 12:11 michaelwoerister

Thanks for writing that up, @eddyb! This protocol looks quite promising to me. I like how it basically keeps the file format intact and is actually not that hard to understand (surprisingly so for something involving locks and atomic operations :smile:)

The main obstacle I see is the handling mmap'd files, especially how to grow them and doing so on the various different platforms.

Nov 10 '20 13:11 michaelwoerister

For growing, I came up with this hack: Keep the "atomic writes" for the events, but have a "slow path" in the same place where "out-of-bounds panic" would normally be handled (i.e. the result of pos.fetch_add(size_of::<RawEvent>()) went out of bounds of the mmap'd region).

The "slow path" for events can grab the same lock that non-event pages use, and then we can have a common codepath for all writes, inside the lock. Even though events would've already fetch_add'd, whereas non-event pages still have to do that to detect growing is necessary, events should still also fetch_add, to detect that they "lost" the race, and that growing has already occurred.

Even if this is practically unlikely, if there are enough threads that fetch_add overflows (and therefore other threads might start writing at the start of the mmap'd region), we can make overflowing the atomic position mark the file as invalid (using pwrite), or even outright delete it, and panic. This should keep everything safe/correct even in a chaotic extreme.

Once everything is either waiting for a lock (or bumping pos and about to hit the lock as well), we can safely:

increase the size of the file on disk (using fallocate?)
remap the mmap'd region to point into the new area of the file
reset pos to 0 (effectively "unlocking" any incoming event writes)
go back to the start of the locked "slow path" region
- while unlikely, it is possible that a lot of events on another thread would fill the newly-added area, and require growing again, just after pos being reset to 0
(eventially) unlock the non-event pages lock

Nov 12 '20 11:11 eddyb

and is actually not that hard to understand (surprisingly so for something involving locks and atomic operations :smile:)

I spoke too soon :wink:

Nov 13 '20 10:11 michaelwoerister

It occurs to me that I don't know why we would want multiple threads to write to the same page - do we need the exact original interleaving? IIRC we split the event streams into threads anyway.

It would also be more efficient for every event page to contain the thread ID once and not waste space in the events in the page.

Event pages

If each thread has its own "active event page", only running out of space in that needs any synchronization, and it can be a single AtomicUsize which allocates where in the file pages go.

Once we've reserved the offset of a page we could posix_fallocate(fd, new_page_offset, PAGE_SIZE) - IIUC, it doesn't matter what interleaving it takes between threads, and it will monotonically grow the file as needed (e.g. later page can win the race and then an earlier page's posix_fallocate becomes noop).

If the posix_fallocate call succeeds, we can use mmap to switch the page mapping to start at new_page_offset instead, write the page header, and go back to writing the event.

(EDIT: randomly stumbled about https://github.com/NixOS/nixpkgs/issues/101490#issuecomment-716164199 which suggests not all filesystems support fallocate, and ftruncate may need to be used instead - looks like that can also grow the file, but has no monotonic behavior built-in, so we'd need a lock around the call, which is fine by me)

Non-event pages

For non-event pages we could have them managed globally under a lock, while sharing the common AtomicUsize for page offset allocation inside the file. If they're rare enough we can even avoid mmap and just pwrite the data into the right position.

Overall this is a much simpler design, assuming we never want to avoid doing thread deinterleaving. The main downside AFAICT is that it requires an API split between (names subject to bikeshed):

Profiler newtype of some Arc<GlobalState>
- no event recording operations, only emitting into non-events pages
PerThreadProfiler (!Send & !Sync) wrapper of Profiler + thread-specific state
- stores thread ID on creation, which never becomes invalid thanks to !Send/!Sync
  - could even allocate thread IDs from Arc<GlobalState>, or instead register something from std::thread to disallow misuse and error when creating more than one PerThreadProfiler per thread per Profiler
- exposes Profile API as well (Deref I guess?)
- manages its own events page and emits events into it

I think I've been confused before by the lack of a separate PerThreadProfiler, there's probably more than one reason to have this kind of explicit design anyway.

cc @wesleywiser

Apr 03 '22 04:04 eddyb

measureme measureme copied to clipboard

Unbuffered event stream.

Event pages

Non-event pages

measureme
measureme copied to clipboard