fio icon indicating copy to clipboard operation
fio copied to clipboard

Improve flushing pagecache on darwin

Open Developer-Ecosystem-Engineering opened this issue 8 months ago • 7 comments

The current behavior of flushing the pagecache via posix_fadvise(POSIX_FADV_DONTNEED) is not a valid operation on macOS.

This can result in adverse outcomes when measuring performance. The change will instead flush via mmap and msync, which better reflects the desired behavior on macOS

Thank you for the fast review @axboe, will integrate the requested changes!

The latest commit should address all the open feedback, thank you @sitsofe much and @axboe for the feedback. Let us know if there is anything else necessary or desired!

  • Remove debug print
  • Make all comments a single *
  • Use /* */ instead of //
  • Return instead of exit
  • Ensure errno is populated
  • Direct return instead of using a placeholder
  • Move to APPLE
  • Migrate code to os/os-mac.h since it is about macOS

You should fold it into a single patch and force push it again, it doesn't make a lot of sense to have a broken state in the upstream tree.

axboe avatar Mar 31 '25 19:03 axboe

You should fold it into a single patch and force push it again, it doesn't make a lot of sense to have a broken state in the upstream tree.

Not a problem, taking a peek at that failure real quick. Once everyone is satisfied, will squash and force push.

Squashed all commits and force pushed as requested.

Just out of interest do you see a difference running the following job with and without your change?

./fio --filename=fio.tmp --stonewall --thread --size=1G --filename=fio.tmp --bs=4k \
  --name=create --rw=write \
  --name=cached --rw=read --loops=2 --invalidate=0 \
  --name=invalidated --rw=read --loops=2 --invalidate=1

@sitsofe we'd expect to see the removal of cached reads on macOS.

An example data set 27.2GiB/s-27.2GiB/s (29.2GB/s-29.2GB/s) -- A cached read 3290MiB/s-3290MiB/s (3450MB/s-3450MB/s), -- An uncached read

Hi @sitsofe,

Have most of what you've requested ready, but want to give the project full answers to all the open questions, will deliver a new force push squash with the changes + answers for the remaining open questions above. It takes a little bit for us to work through the qualification on the answers, appreciate the patience.

Hi @Developer-Ecosystem-Engineering:

Have most of what you've requested ready, but want to give the project full answers to all the open questions, will deliver a new force push squash with the changes + answers for the remaining open questions above. It takes a little bit for us to work through the qualification on the answers, appreciate the patience.

It's just a bit of a shame the turnaround time is so long because I'd love to see your work go in as soon as possible! It's nice to keep the momentum but I appreciate you need responses from others and that depends on their time.

In the mean time I've put up https://github.com/sitsofe/fio/tree/refs/heads/macos_posix_fadvise which is an idea which builds on your work and moves it into an object of its own. I've made the commit message a bit longer to:

  • Show what's been tested and how things have changed
  • Provide references to old commits that do something similar on other platforms

What do you think of it? If it looks useful please take from it!

Finally, would it be possible for you to use a Real Name in your commit's? Currently we have this:

Author:     Developer-Ecosystem-Engineering <65677710+Developer-Ecosystem-Engineering@users.noreply.github.com>
AuthorDate: Wed Mar 12 15:27:29 2025 -0700
Commit:     Developer-Ecosystem-Engineering <65677710+Developer-Ecosystem-Engineering@users.noreply.github.com>
CommitDate: Wed May 14 09:58:58 2025 -0700
[...]
Signed-off-by: [<groupemail>]@apple.com

I've talked to @axboe and he said this:

What I care about is the identify identifies a person, and:

Developer-Ecosystem-Engineering <65677710+Developer-Ecosystem-Engineering@users.noreply.github.com>

literally means nothing. So yeah, that needs to be a proper [identity] and email.

So could you:

  • Make sure your commits have a real email address (as opposed to GitHub obfuscated one)
  • Could you use a Real Name with your commit

(see https://stackoverflow.com/questions/3042437/how-can-i-change-the-commit-author-for-a-single-commit for hints on how to fix up existing commits with new authors)

Thanks!

sitsofe avatar Jun 18 '25 07:06 sitsofe

Hi @sitsofe, have read through the changes in your branch and it looks to incorporate all of the feedback & open questions already. You noticed the test failure and the downstream posix_fadvise issue. It also addresses the feedback about the comments and organization.

It probably makes sense for the project to take your changes since it incorporates all the feedback, we looked at rebasing on top of it with any changes we would have and there was nothing left to add! It also solves the open feedback on sign-off.

The other open question above was answered and there is one remaining, which I am working on.

If we mmap a 1TByte file fully but don't write to will any extra memory be used aside from set up of the TLB?

@Developer-Ecosystem-Engineering:

Just for the record - I never intended to steal your thunder but thank you for blessing my additions. I want to see your idea and work go into fio in some form because it solves a genuine behaviour issue on macOS.

It also solves the open feedback on sign-off.

Not quite, if we're going to go with my branch then I would end up squashing everything into one commit and it is only fair to to credit the person that delivered the solution to the UBC cache invalidation problem I failed to solve several years ago. Please could I have a name and email address so I can add a Suggested-by / Co-developed-by line? If you really want it to be the ecosystem address I will use that...

sitsofe avatar Jul 16 '25 21:07 sitsofe

Hi @sitsofe

We need no thunder, it's OK =) Just want to help too. We'd prefer that we use the group identity if you must use anything, but it's perfectly fine to not have it, it's your code base! If you don't want the GitHub one as its a bit verbose, our GitHub handle without the dashes is the email address.

Must we unmask the Dread Pirate Roberts? =)

If we mmap a 1TByte file fully but don't write to will any extra memory be used aside from set up of the TLB?

When you mmap a 1TByte file without actually writing to it, the primary work is in virtual memory setup. The kernel will only modify the process memory map, but no pages table entries are created until writes and accompanying page faults occur.

These faults would occur during access time. If the information is in the UBC, then the process mapped address range will use the resident pages.

Must we unmask the Dread Pirate Roberts? =)

Gotcha ~~Westley~~ @Developer-Ecosystem-Engineering - your mystique shall continue on.

Re mmap: thanks for the update. I'll see to reworking things a bit Real Soon Now™ and I'll add you to the review so you can see it going past.

sitsofe avatar Jul 18 '25 19:07 sitsofe

I finally found the time to work on this but I have a bunch of things that I want to bring to people's attention. Here’s a summary:

  • Is invalidating all of an otherwise partial page OK?
  • Should readahead be hinted at by job type on macOS?
  • Should the mmap chunk size be something other than 16GiB?
  • There seems to be a bug truncating massive files on macOS 15.6 under APFS (@Developer-Ecosystem-Engineering)

It’s possible to ask fio to invalidate the cache for a range that isn’t page aligned (e.g. by doing ./fio --name=n --offset=2k --size=18k ) but the mmap based shim has to work with a granularity of a page. The Linux posix_fadvise man page says the following about POSIX_FADV_DONTNEED:

Requests to discard partial pages are ignored. It is preferable to preserve needed data than discard unneeded data. If the application requires that data be considered for discarding, then offset and size must be page-aligned.

The Open Group posix_fadvise documentation and the FreeBSD posix_fadvise man page are silent about what happens to partial pages. For the fio macOS posix_fadvise shim I’ve taken the approach of invalidating all of a page that would otherwise only be partially covered by aligning the offset down and the len up. I can change this if people would prefer more Linux-like behaviour.

Since a large mmap doesn’t really cost that much more memory I figured I would set the mmap chunk size to be larger so that fewer mmap/msync/munmap calls would be done thus reducing the overhead of setting things up/tearing them down repeatedly. Here are the results of my testing:

macOS 15.6 Intel (2GHz i5) msync(MS_INVALIDATE)

mmap chunk size (GiB) 0.5 1 8 16 128 1024
linear read 0.3710356 0.4872094 0.3823824 0.3986368 0.5639938 1.9420478
rand read 10.265224 13.656658 11.046206 11.286724 11.46629 11.436732
gappy 7.854328 10.5910972 9.5441084 9.5344792 9.5729838 9.5502052
near empty 0.0213032 0.0185342 0.0037254 0.0018804 0.0002896 0.000118

macOS 15.6 M4 Pro msync(MS_INVALIDATE)

mmap chunk size (GiB) 0.5 1 8 16 128 1024
linear read 0.0514746 0.0448044 0.0401286 0.0419396 0.0646114 0.2266038
rand read 1.3273664 1.3098192 1.2935882 1.3060148 1.2855568 1.3004732
gappy 1.1701312 1.1473312 1.1202244 1.1186982 1.1388276 1.1316176
near empty 0.0251256 0.0101194 0.0011742 0.0009186 0.000453 0.000403

macOS 15.6 M4 Pro Rosetta msync(MS_INVALIDATE)

mmap chunk size (GiB) 0.5 1 8 16 128 1024
linear read 0.0592442 0.0535576 0.0443478 0.0457978 0.0666314 0.2284822
rand read 1.3480976 1.333327 1.3117098 1.3103916 1.3127036 1.2981126
gappy 1.1752522 1.1550404 1.1484232 1.1360098 1.134033 1.143597
near empty 0.028696 0.0207544 0.0085264 0.0080604 0.0063026 0.006159

Linux AMD EPYC 7742 (2.2Ghz) posix_fadvise([…], POSIX_FADV_DONTNEED)

Fill I/O pattern Time (s)
linear read 1.6783038
rand read 4.6355044
gappy 4.0741604
near empty 0.0000204

The additional patch to do log the time to do invalidation (0001-DRAFT-add-cache-invalidation-logging.patch) and a script to do the benchmarking (bench.sh) are attached. Results are the time it took to invalidate the page cache after performing 8 GBytes of 4 KByte sized reads in a variety of patterns/situations against a 1 TByte sparse file (and different mmap chunk sizes on macOS). Where there are multiple results in a row the quickest time is highlighted in bold.

Summary: The M4 Pro Mac results are extremely quick, the Intel Mac machine was in last place but performed best with a small mmap chunk size and using macOS’ Rosetta doesn’t slow things down by that much. To my surprise, it's only when the page cache is empty does a larger chunk size perform the best! On the Apple Silicon Mac using a chunk size of 16 GiB compared to 1 GiB took a tiny bit less time in the near empty page cache case (which I would hope is the most common scenario) and didn't do badly in other cases so I’ve gone with that but I’m open to suggestions.

While doing the aforementioned tuning, I was shocked to find that macOS can take more than a second to invalidate 8 GBytes of cache loaded randomly. I didn’t remember dropping the page cache on Linux being that slow so I tried the same test on that OS and found it took even longer! Unfortunately time spent invalidating the cache is counted as part of an fio job’s runtime - if you have a job whose runtime is 1 second but invalidating the cache takes 2 seconds, then only a single I/O is done and the bandwidth reported is very low because most of the job’s time was spent invalidating the cache. Fixing this is hard because invalidation happens when a file is opened and further, each time a loop is completed the file is re-opened. For now I’ve just added an ETA status for invalidation so it’s a little bit more visible, although the status is only updated 3 seconds after the first job starts so if the initial invalidation takes less time than that the ETA status won’t show it. I can drop the ETA change if people don't feel it's useful.

macOS’s default readahead seems to be able to turn itself off fairly rapidly. I found that only random/non-sequential jobs that trick readahead into getting a few hits before seeking to a new location show better performance with readahead turned off via fadvise_hint=random. Do people still want the readahead turned off by default on macOS random workloads?

There is a 100% reproducible bug on at least macOS 15.6 where if a large sparse file is created (e.g. 1 Petabyte big), 16 random reads are done from it and the file truncated to 1 Terabyte, then truncation will spin a CPU indefinitely. When in this state, the truncation can’t be aborted/killed even during system shut down (when starting back up macOS asks the user to report an issue). Trying the same test on a file on Linux under XFS (because ext4 limits file sizes to 16 Terabytes) does not show the same behaviour - truncation finishes after a few seconds. The following reproduces the problem on both an Intel Mac and an M4 Mac:

rm -f /tmp/huge; truncate -s 1p /tmp/huge
./fio --rw=randread --number_ios=16 --filename /tmp/huge --norandommap --name=n
truncate -s 1T /tmp/huge

@Developer-Ecosystem-Engineering can you reproduce the problem too?

sitsofe avatar Sep 03 '25 20:09 sitsofe

Apart from that one patch adding the invalidating state, looks fine to me. Maybe we just defer that decision to later and you just drop it from now? Or if you think it really needs to be there, let's here the reasoning :-)

axboe avatar Sep 03 '25 23:09 axboe

There is a 100% reproducible bug on at least macOS 15.6 where if a large sparse file is created (e.g. 1 Petabyte big), 16 random reads are done from it and the file truncated to 1 Terabyte, then truncation will spin a CPU indefinitely. When in this state, the truncation can’t be aborted/killed even during system shut down (when starting back up macOS asks the user to report an issue).

We reproduced it! Tracking with 159795525.

@axboe:

Apart from that one patch adding the invalidating state, looks fine to me. Maybe we just defer that decision to later and you just drop it from now? Or if you think it really needs to be there, let's here the reasoning :-)

I've become quite attached to that patch while doing this testing these changes :-) However, my sentimentality isn't enough that it has to go in now and given how rarely it shows on the ETA let's drop the patch here (I'll keep it stashed on a branch somewhere).

@Developer-Ecosystem-Engineering:

We reproduced it! Tracking with 159795525.

Thank you!

sitsofe avatar Sep 04 '25 07:09 sitsofe

A quick check @Developer-Ecosystem-Engineering, @axboe did you feel this PR needs further changes?

sitsofe avatar Sep 05 '25 21:09 sitsofe

Thanks everyone, now merged.

axboe avatar Sep 06 '25 00:09 axboe