btrfs-progs icon indicating copy to clipboard operation
btrfs-progs copied to clipboard

Mounting btrfs with an atime, bad idea?

Open TheEvilSkeleton opened this issue 3 years ago • 40 comments

There was this post from 2012 I read a couple of weeks ago called Atime and btrfs: a bad combination?, stating btrfs' issues with mounting with something like relatime.

It was an interesting read, but I'd like to ask if the issues still hold true to this day, and if mounting btrfs with an atime is a bad idea or not.

I found two relevant sections in the btrfs wiki so far:

  1. https://btrfs.wiki.kernel.org/index.php/Manpage/btrfs(5)#NOTES_ON_GENERIC_MOUNT_OPTIONS
  2. https://btrfs.wiki.kernel.org/index.php/FAQ#Why_I_experience_poor_performance_during_file_access_on_filesystem.3F

TheEvilSkeleton avatar Jun 16 '21 12:06 TheEvilSkeleton

I would not say the combination of snapshots and atime is "bad", since the combination is required by some use cases and the costs are reasonable on some combinations of workload and storage hardware; however, If the user is not committed to spending a comparatively large amount of time, space, and IO on both snapshots and atime updates, and managing the extra space requirements for the combination of the two, then they should use noatime or avoid using snapshots. Because the costs are relatively high for atime and the benefits relatively low (and the users who require those benefits relatively rare), I'd recommend making noatime the default unless the user has provided some hint otherwise (e.g. by installing a package known to depend strongly on atime).

The issues are pretty fundamental to both snapshots and atime. XFS and LVM snapshots with other filesystems have similar problems. In all cases, running out of space during CoW is bad enough to at least halt all further writes. btrfs can throw errors on individual write operations before space is completely exhausted, while LVM requires active monitoring of snapshot LV usage to avoid failures that can only be detected after write commands are issued. btrfs might be forced into a read-only state that can only be recovered by umount and mount, while LVM throws IO errors on writes to the LV after free space is exhausted. LVM must either reject all writes to original and snapshot LV (forcing both to be umounted), or allow writes to the origin LV to continue while the snapshot LV becomes unrecoverably broken. I haven't tested XFS's out-of-space behavior, but it looks very similar to the LVM case structurally, so I would not expect substantively different results. We can say "btrfs is quantitatively better than the XFS/LVM+other filesystem experience" here, but most users don't have the worse experience to compare btrfs to, and may not be expecting to have to cope with CoW space exhaustion issues at all.

Snapshots on btrfs are lazy reflink copies, which have different cost tradeoffs than block-level snapshots. Snapshots on btrfs are extremely cheap if you never modify the shared pages before deleting the snapshot, but they can be extremely expensive if you modify all of the shared pages (most use cases fall somewhere in between). A full reflink copy (e.g. cp -a --reflink=always) must create parent references for every individual extent item referenced in the new tree, so a reflink copy of a subvol will create slightly more new metadata pages in total than were present in the original subvol's metadata tree. A snapshot has the same worst-case IO and storage cost as a full reflink copy, but those costs are spread out over time, and hidden in other write operations that modify metadata pages. atime updates are one of those other write operations.

A snapshot can be fully converted to a reflink copy when as little as 1% of its content is modified, so a recursive grep or even a find with any form of atime update enabled can force such a conversion. IO latency and metadata space costs must be paid at the time the copied pages are made, which will be during or shortly after the read access operation (there is also a second extra cost to delete the copied metadata pages when the snapshot is deleted). btrfs's cost model is different from XFS and LVM here: on LVM, the original subvol and snapshot are structured such that the snapshot can be trivially deleted at any time without (non-constant) cost, while the original subvol is very expensive to remove (other than the special case where both original and snapshot volumes are deleted). On btrfs there is no such distinction between original and snapshot subvol, so the deletion cost of either subvol is proportional to how much reflink conversion has occurred. On XFS the deletion costs are equivalent to btrfs, but the scalars for XFS are two orders of magnitude smaller than btrfs due to the much larger size of btrfs metadata. An XFS snapshot is not a lazy reflink copy (it is literally a reflink copy, all costs paid at snapshot creation time), so atime updates on XFS don't incur additional costs after the snapshot was created as they do on btrfs.

It may be much less convenient for the user to have IO latency and metadata space usage elevated during read-intensive times (e.g. while the user is working with the machine interactively) compared to the snapshot creation time (i.e. during overnight maintenance windows or at non-peak service times). It may also be inconvenient for the user to increase the time required to free space by deleting snapshots (making it take longer to recover their space when needed).

relatime may make this effect worse for some users, by batching up a lot of atime updates at close to the same time each day (e.g. when the user logs in at the start of the work day, triggering the once-every-24-hours update behavior of relatime). On the other hand, if the user is making more than one snapshot per day, then relatime will limit the reflink conversion effect to only one of the snapshots each day.

The impact of all this is relative. Metadata updates on random pages are extremely expensive on big, slow spinners, and not at all expensive on small, fast NVME devices. All writers of the filesystem will be blocked while the metadata updates are committed in a transaction, and the memory pressure from blocked writers can block readers and even processes that aren't touching the filesystem (i.e. those allocating memory or referencing lazily mapped memory). This can result in the entire system freezing up for some time--from a fraction of a second on fresh installs to NVME devices, to several minutes on larger and older filesystems on slower spinning disks. This is true of any large metadata update--atime is just one special and easily avoidable case. A similar effect can occur when running git clean on a snapshotted source tree, or chmod -R on a big tree, but these are not as surprising when triggered by write operations as they are when triggered by read operations.

"ENOSPC on read" mentioned in the LWN article is a real outcome but not entirely accurate. Reads don't fail due to lack of space, but the atime updates that succeed will consume metadata space if there is a snapshot, and in the extreme case, metadata growth will fill up the filesystem so no further allocations are possible. Once that point is reached, the only possible writes are those that reduce metadata usage or increase available space: deleting or truncating unshared files, removing entire snapshots, adding disks, or resizing devices larger. Every other write operation decreases available metadata space, and is not possible when no remaining space is available for metadata. Normally this is not a problem, because btrfs allocates space for metadata generously. Usually, as long as metadata is never balanced, there will be enough space for common metadata expansion cases; however, there are still exceptional cases, especially on filesystems where raid profile conversions, device resizes, or metadata balances have deviated from the usual space allocation patterns.

Some of the bad effects can be reduced by creative scheduling: e.g. make a snapshot for backups at 3 AM, run plocate at 4 AM, and by 8 AM all of the atime-related snapshot conversion costs are fully paid, so the machine is ready to be used without unexpected IO multiplications. plocate must never scan the snapshots or the number of snapshots must be limited; otherwise, plocate will take more than 4 hours to run. If metadata grows slowly and continuously over time, it is less likely to be a problem than rapid ad-hoc growth.

The opposite case (i.e. worst-case creative scheduling) would be e.g. running find / for the first time on a large, full filesystem with a snapshot, when no process has ever walked the filesystem tree before, while atime or relatime is enabled. This could force the filesystem into no free metadata space with a sudden new demand for metadata space that had never been needed before.

Zygo avatar Jun 17 '21 04:06 Zygo

With the strictatime / relatime concerns mentioned in this issue, does this likely make the lazytime option a worse choice? ArchWiki has some good info about it (depending on the trade-offs):

lazytime reduces writes to disk by maintaining changes to inode timestamps (access, modification and creation times) only in memory. The on-disk timestamps are updated only when either:

  1. The file inode needs to be updated for some change unrelated to file timestamps
  2. A sync to disk occurs
  3. An undeleted inode is evicted from memory
  4. If more than 24 hours passed since the the last time the in-memory copy was written to disk.

Warning: In the event of a system crash, the access and modification times on disk might be out of date by up to 24 hours. Note: The lazytime option works in combination with the aforementioned *atime options, not as an alternative. That is relatime by default, but can be even strictatime with the same or less cost of disk writes as the plain relatime option.

That might not suffer the same bulk relatime update (potentially updating atime on first access and then possibly again 24 hours later if any subsequent accesses), depending on context, though the bulk probably would still be applied at that 24 hour window or during a sync that applies to all affected files (eg shutting down?).

As it also affects mtime and ctime, would that further strain the I/O concerns, or would their inclusion be minimal?

As long as no new snapshots are created in-between, it'd seem that depending on activity it might be more staggered of an operation in practice?

polarathene avatar Jun 17 '21 08:06 polarathene

Some other filesystems separate inodes from file contents by some physical distance, so lazytime can prevent an expensive head movement or a write to a separate flash zone between data block flushes. Other filesystems that write data in-place usually don't have requirements to update inode metadata other than timestamps when only file contents are changed.

On btrfs, file contents are described in reference items that are interleaved with the inodes in metadata pages. When file contents change in a normal (datacow) file, the data is stored in a new location and the file's metadata is updated to point to that location. In that process, the inode is usually written out to disk because it is located in the same metadata block, so there is very little additional cost to update the timestamps on disk and therefore very little saving by not updating the timestamps. When a snapshot is present, every file is datacow because all blocks are shared between the original subvol and snapshot (even nodatacow files). Thus, lazytime does not save very much work compared to relatime (though it will save a little because sometimes the inode is in a different metadata page from the data references, e.g. if an inode happens to be one of the ~1% of all filesystem items that occupies the last slot on a metadata page, then all of the inode's data reference items are necessarily on a separate page where lazytime could provide some benefit. Also an unshared nodatacow file has no need to update anything but the data blocks so lazytime can help those as well).

lazytime reduces write loads from all timestamp updates compared to cases where it is not used, though it may not reduce very much in common cases on btrfs. It wasn't part of the original discussion. The original discussion contains concerns about the adverse effects of dropping atime writes or having any non-default mount option. There is far more existing software that is adversely affected by dropping mtime and ctime writes than atime, and given the stated preference for default options, lazytime would be a step in the wrong direction for an installer default.

Zygo avatar Jun 17 '21 15:06 Zygo

There is far more existing software that is adversely affected by dropping mtime and ctime writes than atime

lazytime isn't dropping mtime and ctime like noatime does for atime...?

It's paired with strictatime or relatime (if neither specified paired with the default), the changes to atime, mtime, ctime are more accurate but remain in memory instead of flushed to disk (until one of the conditions are met that would do so).

I don't follow with how that negatively impacts software, other than the warning scenario of power loss or crash failing to persist any in-memory updates.

The original discussion contains concerns about the adverse effects of dropping atime writes or having any non-default mount option.

While lazytime does affect mtime and ctime, it still impacts atime updates differently than strictatime or relatime options without lazytime? Seems relevant to the discussion. I only bring up mtime and ctime additions as lazytime additionally affects those too, and I wasn't sure if lazytime would be appropriate for BTRFS vs just relatime by itself, if choosing not to use noatime.

Since you seem to have good insights into the impact, this seemed like a relevant related question that someone coming across this issue may also have thought about.


On btrfs, file contents are described in reference items that are interleaved with the inodes in metadata pages. When file contents change in a normal (datacow) file, the data is stored in a new location and the file's metadata is updated to point to that location. In that process, the inode is usually written out to disk because it is located in the same metadata block, so there is very little additional cost to update the timestamps on disk and therefore very little saving by not updating the timestamps.

Is this referring to a file write operation and thus stating that there is minimal benefit to lazytime because mtime/ctime would be updated anyway?

I was referring to reads with atime. If the IO operation is going to trigger one of the sync timestamps to disk conditions, I'm assuming that with write operations it's less of a concern. But this issue was about read operations causing writes, and the impact of that risking notable disk usage increase as shared extents from a snapshot become unshared due to writes for new atime right?

Thus, lazytime does not save very much work compared to relatime

My understanding of the difference was in timing of when atime updates were persisted to disk.

relatime can update atime on first file access straight away to disk, and then delay further atime updates until 24 hours pass?

lazytime (when paired with relatime) would not update to disk like relatime could do (if the condition is met), but would instead keep that change in-memory until one of the lazytime conditions is met. That reduces/delays updates to disk until relevant, but keeps the atime change available in memory for other software to be aware of, as if it was the atime for that file on disk no?

Depending on system usage, some files atime may be updated on disk within that 24 hour window earlier, unlike relatime AFAIK? Or have I misunderstood, and even in read-only situations which this issue discusses, lazytime is not providing much benefit over relatime either? (while the mtime and ctime updates also being handled differently isn't of any concern with lazytime and BTRFS?)


lazytime reduces write loads from all timestamp updates compared to cases where it is not used, though it may not reduce very much in common cases on btrfs.

That's good to know, I'd just like clarification about the perspective on reads too vs relatime without lazytime.

Presumably there is also benefit for those caring about atime to pair strictatime with lazytime, which would otherwise from my understanding be less desirable than relatime on BTRFS?

polarathene avatar Jun 18 '21 02:06 polarathene

OK, now that I've read the actual code, I see that lazytime is not the feature I was thinking of (the "turn mtime completely off" feature for Ceph and similar upper storage layers that have no need for mtime/ctime at all).

lazytime changes the timestamp update function so it sets the I_DIRTY_TIME bit but not the I_DIRTY_SYNC bit on timestamp updates. Inodes are written to disk if I_DIRTY_SYNC is set but not I_DIRTY_TIME. The last iput (i.e. when the file is closed by the last process to have it open) checks for I_DIRTY_TIME and flushes the inode to disk if it is set when the inode is closed (and not deleted). sync and fsync turn I_DIRTY_TIME into I_DIRTY_SYNC. umount is only possible after every inode has been released by iput. That covers the 4 cases in the descriptions of the feature's behavior.

So lazytime only reduces the number of writes for files that are held open for a long time (e.g. database/VM image files, or binaries in /usr if they fault in new pages while running). Otherwise, it only delays the inode write until the file is closed. There is apparently also a sysctl which can be used to force inode updates to happen more often when lazytime is enabled.

In the find and grep cases, most of the opened files are immediately closed, so the effect of lazytime is negligible on inode updates--these tools will follow whatever policy was set by strictatime, relatime, nodiratime, or noatime. The total number of writes will only be reduced if the grep is still reading a file while the writeback timer expires--lazytime would drop an inode update in that case, since it won't mark the inode dirty before the file is closed.

lazytime will adversely affect ctime/mtime timestamp accuracy on disk, because it separates timestamp updates from other updates at the VFS level, above the filesystem. So lazytime can lead to a situation where the inode's data contents are flushed and persisted on disk by btrfs, but the VFS layer didn't trigger an inode update at the same time, so the timestamp in the inode on disk is much older than the data. For ext4 this is excusable, since ext4 declines to make file content consistency guarantees after a crash; however, btrfs datacow files are expected to be consistent after a crash (the file contents and metadata are to be consistent as of the latest transaction commit or fsync, all modifications after that point have no effect on the file metadata or contents after the crash), so this could adversely affect workloads on btrfs that expect the post-crash consistency.

Zygo avatar Jun 18 '21 05:06 Zygo

In the find and grep cases, most of the opened files are immediately closed, so the effect of lazytime is negligible on inode updates--these tools will follow whatever policy was set by strictatime, relatime, nodiratime, or noatime.

Ah.. so I've misunderstood (and perhaps many other less experienced users in this area from what I can see online), the atime update in the cited cases where you might do a large read-only operation that causes a bulk metadata update and breaks sharing extents.. is going to update to disk rather quickly?

I was under the impression the update would be delayed and remain in the file cache (page-cache?) in RAM (provided there is sufficient memory spare), but that's not what happening :disappointed:

Your description doesn't make it sound useful to a desktop or workstation user then? It doesn't really act as an in-memory cache for a meaningful amount of time for the majority of files where atime updates is the concern..

btrfs datacow files are expected to be consistent after a crash (the file contents and metadata are to be consistent as of the latest transaction commit or fsync, all modifications after that point have no effect on the file metadata or contents after the crash), so this could adversely affect workloads on btrfs that expect the post-crash consistency.

Making lazytime ill-advised for BTRFS?

So lazytime only reduces the number of writes for files that are held open for a long time (e.g. database/VM image files, or binaries in /usr if they fault in new pages while running).

Database and VM would often be using nodatacow, but as you mentioned can still benefit from snapshots. I'm not really following how it helps reduce the atime updates to disk here compared to just relatime (unless it does delay that potential update on first read?), it's not due to mtime is it? (I thought I had read that doesn't update until a file is closed as well)


There was a discussion with the Fedora Working Group regarding defaulting to noatime for BTRFS where Lennart of systemd chimed in about atime being relevant for systemd-tmpfiles to function reliably/optimally, oddly he was in encouraging strictatime over relatime for that specific usecase.

While those involved were mostly in favor initially of noatime they decided by the end to retain the relatime default. They do note that they don't presently make much use of snapshots with their default BTRFS install, and some don't seem to think relatime pragmatically causes much concern.

They do care about responsiveness though, and may not have the same insights that @Zygo has expressed here with I/O pressure? :sweat_smile:


So to sum up the discussion..

noatime is likely preferable unless you know you need atime for something despite the trade-off.

noatime avoids running into some surprises if a large amount of files are accessed and the bulk of them update atime:

  • Reading from a BTRFS filesystem as source, and copying to another filesystem such as external disk can negatively impact I/O performance? (reference: BTRFS requires noatime (2017))
  • Performing a read-only operation such as a file transfer or searching through many files (eg grep or find on / or some other location with N files) can break the sharing of a snapshots extents for metadata due to CoW, likely increasing disk usage unintentionally from the users perspective. -The disk usage issue compounds with frequency of snapshots assuming similar read heavy operations occur between them. This can muddy the atime updates in the snapshots with data that may be valued in the snapshots, requiring extra effort to remedy if wanting to retain snapshots while freeing up space by removing the atime metadata updates? (which for most software where atime actually matters, the history in snapshots isn't likely to be relevant to retain)

If choosing to ignore the concerns that noatime prevents because atime is important to the user (systemd-tmpfiles may the most relevant beyond more niche software for a desktop/workstation user), they can use:

  • relatime: Ideal for minimizing atime updates which is useful if taking regular snapshots (daily, hourly?), as to avoid the compounding effect between snapshots. While the number of atime writes is minimized from subsequent reads (eg recurrent reads for: a nodeJS project with a large node_modules directory, a Docker image build with many layers, apps/plugins that scan through project files, etc?_), a delayed batch write to disk may be less convenient UX wise depending on the timing.
  • strictatime avoids the delayed writes, but all reads incur writes regardless. If snapshots are regularly occurring the negative impact on disk size usage will be more pronounced, and each snapshot incurs the CoW cost on metadata for atime (unlike relatime AFAIK?)
  • lazytime paired with either isn't likely to be that beneficial in practice and it's trade-off when a power loss or crash event occurs adds more risk on data with BTRFS than it does other filesystems?

TL;DR

noatime is likely preferable unless you know you need atime for something despite the trade-off.

  • noatime avoids unexpected disk usage allocations when paired with snapshots from read operations, as well as potentially poor I/O performance due to each file read incurring a write.
  • If you need atime updates:
    • relatime for systems with frequent snapshots.
    • strictatime with minimal to no snapshot usage has less I/O pressure than relatime.
    • lazytime should be avoided, it risks data consistency in power loss or crash event.

polarathene avatar Jun 18 '21 14:06 polarathene

Ah.. so I've misunderstood (and perhaps many other less experienced users in this area from what I can see online), the atime update in the cited cases where you might do a large read-only operation that causes a bulk metadata update and breaks sharing extents.. is going to update to disk rather quickly?

strictatime marks the inode dirty during each read() call. lazytime marks the inode specially, so the inode only becomes dirty when it is closed after a read (the inode may become dirty for other reasons--if that happens, the atime is flushed with the rest of the inode). relatime marks the inode dirty during a read() call, but only if this would change the order of atime vs mtime/ctime, or if the old atime is 24+ hours old. noatime doesn't change anything about an inode during a read().

Once an inode is marked dirty, the VFS layer may direct the filesystem to flush the inode to disk at any time. This doesn't necessarily happen immediately, the inode just becomes eligible to be flushed. A dirty inode can't be simply discarded when no process has the inode's file open and memory runs low, so when memory runs low, memory allocations must be delayed until flushes are completed. Normal writeback is based on a timer, but umounting the filesystem, running low on memory, calling fsync(), etc will make it happen sooner.

If the user hasn't changed the writeback expire timeouts, we can assume any dirty inode is on its way to disk within 30 seconds or so. That's not always the case, e.g. laptop mode can set the writeback time to minutes or hours, if the user prefers longer battery runtime over updating trivial data.

'm not really following how [lazytime] helps reduce the atime updates to disk here compared to just relatime

lazytime doesn't flush the inode for an atime update while the file is open--the atime update happens when the last process closes the file. If one process holds the file open while other processes open and close, then the atime updates won't trigger a flush to disk.

The disk usage issue compounds with frequency of snapshots assuming similar read heavy operations occur between them. This can muddy the atime updates in the snapshots with data that may be valued in the snapshots, requiring extra effort to remedy if wanting to retain snapshots while freeing up space by removing the atime metadata updates?

I'm not sure what the question is here. Snapshots act like independent filesystems. If you read the files in one subvol, it doesn't affect the atimes of files in any other subvol. It just takes up more space on disk, because you have two copies of the tree with different timestamps on the inodes. The atime updates can't be deleted separately from the snapshot that holds them.

e.g. 1: if you make hourly snapshots and read files continuously in the original subvol, and you mounted with strictatime, then you'll have one snapshot where the atimes are from 3 to 4 o'clock, another snapshot where the atimes are between 4 and 5 o'clock, a third snapshot where the atimes are from 5 to 6 o'clock...each will be a mostly complete copy of the original subvol's metadata, with different timestamp values on all the inodes.

e.g. 2: If you have hourly snapshots with relatime and read all the files every minute, then you'll have 23 snapshots with shared metadata (all the same inode timestamps), then one snapshot where all the atimes are updated for that 24-hour period (a copy of most metadata), then 23 more sharing space with the 24th. Every 24 hours there will be a new set of atime timestamps, so a new full set of metadata, then 23 smaller snapshots that share metadata with the first.

systemd-tmpfiles may the most relevant beyond more niche software for a desktop/workstation user

For tmpfiles and automatic file reaping you should not use a snapshot. The snapshot will prevent the file reaper from freeing any space (reaping activity will allocate more space instead as nothing is freed, and metadata pages are CoW-modified).

For a tmp reaper there is not much practical difference between ctime and atime for selecting old files. If you need a file to persist for a long time, why is it in /tmp? If you don't need long persistence, then atime extending the persistence doesn't matter. The main exceptions to this rule are authentication tokens, like kerberos ticket files or .Xauthority, which might not be stored in $HOME for various reasons. Reaping those while they're still in use would be bad.

lazytime paired with either isn't likely to be that beneficial in practice and it's trade-off when a power loss or crash event occurs adds more risk on data with BTRFS than it does other filesystems?

An example of where this matters: suppose you have a build server which checks out a directory, runs a big recursive make. The server is set up with the following constraints:

  1. btrfs filesystem mounted -o flushoncommit
  2. good power management, stops CPU and PCI before RAM (so no corruption can escape during power failure)
  3. good drive firmware, no write reordering bugs or reneging on write ACKs on power failure
  4. the build tools and Makefile rules handle SIGKILL well
  5. all dependencies in the Makefiles are correct

then the following events should have identical effects on the content of the build directory:

  1. at some random time T during the build, kill all build-related processes simultaneously with SIGKILL.
  2. at some random time a few seconds later than T, kill power to the machine.

In both cases, it should be possible to restart the build iwth 'make' again, and always produce correct build output. This should remain true even if source files are modified. If precondition 3 fails then an error shall be detected and reported to the build process.

With lazytime, the timestamps may not be updated at the same times as the file contents, so scenario 1 and 2 may produce different results. (They'll produce different results every time because the results are nondeterministic; however, the set of results from event 2 should contain no result that cannot also be obtained from event 1).

For comparison, ext4 doesn't meet precondition 1, so after a power failure the build directory is almost always garbage, and the build must be restarted from git clean. Or you have to run ext4 in sync mode to make this work, and get very bad performance.

Now there are a lot of preconditions to be met before we can consider this example relevant. As a practical matter, I'd never believe that precondition 5 had been met unless the build project was small enough to audit personally, so I'd always run git clean -dfx after either a power failure event or a SIGKILL event even though theoretically I don't need to. Also a lot of legacy / proprietary code fails precondition 4, which means SIGKILL breaks the build directory even if nothing goes wrong with the filesystem.

btrfs does have cases where it's at least possible to resume a build after a power failure, and lazytime breaks those.

Zygo avatar Jun 19 '21 06:06 Zygo

lazytime marks the inode specially, so the inode only becomes dirty when it is closed after a read (the inode may become dirty for other reasons--if that happens, the atime is flushed with the rest of the inode).

The quote you were responding to with this was specifically regarding my misunderstanding regarding lazytime with reads impact on atime updates.

Apologies, I see I wasn't particularly clear about that in the quoted paragraph and just assumed we'd be on the same page about discussing lazytime which I was responding to.


lazytime doesn't flush the inode for an atime update while the file is open--the atime update happens when the last process closes the file. If one process holds the file open while other processes open and close, then the atime updates won't trigger a flush to disk.

You covered the VM or database type of data as an example for when a file is likely open for a longer duration.

The quote you were responding to was asking what tangible benefit was lazytime adding for atime updates with scenarios like find / opening and closing a large amount of files with reads, which you noted earlier as closing the file triggering the write a lot sooner than I and perhaps others thought.

You clarified that the inode is marked as dirty and from that point is written to disk when a condition is met (dirty time expiring, low memory, etc). Provided nothing else is keeping the file open, lazytime is not reducing atime updates to disk that much more than relatime alone would, given the find / bulk read example?

Your descriptions of relatime vs lazytime seem to convey that relatime will update atime upon read if appropriate, while lazytime would delay that disk write. Is this what you meant by lazytime potentially reducing writes for atime with reads? (even if the delay may not offset a write by much longer)


I'm not sure what the question is here.

I was seeking confirmation if the statement was correct. I tried to be terse as our responses are already a tad verbose :sweat_smile:

Snapshots act like independent filesystems. If you read the files in one subvol, it doesn't affect the atimes of files in any other subvol. It just takes up more space on disk, because you have two copies of the tree with different timestamps on the inodes.

I understood this.

The atime updates can't be deleted separately from the snapshot that holds them.

I was trying to highlight this issue, and that dealing with the issue after the it's polluted snapshots would be frustrating to workaround. One could presumably recreate the snapshots or some other workaround after the fact that excludes the atime metadata, it just isn't likely a desirable workaround to apply.

This was intended in favor of noatime to avoid the situation, since opting for strictatime or relatime for any software that relies on atime to function or work optimally is what causes this to occur from the read scenarios discussed.

In contrast, if you default to noatime and notice that your software isn't working well due to a lack of atime updates... you're not tasked with a frustrating cleanup (or opting for data loss by not trying to retain affected snapshots data), just replace noatime and get the desired atime updates.

noatime is thus the better default to prefer.


e.g. 1 strictatime each will be a mostly complete copy of the original subvol's metadata, with different timestamp values on all the inodes.

e.g. 2 relatime Every 24 hours there will be a new set of atime timestamps, so a new full set of metadata, then 23 smaller snapshots that share metadata with the first.

Yes. This is how I understood them as well and tried to communicate.

strictatime:

  • Is not ideal with hourly snapshots if you're going to have a recurring find / like task. A more likely scenario may be daily snapshots with some scheduled task or usage pattern that would update atime for many files between each snapshot.
  • On the other hand, if you're not frequently creating snapshots.. then it may be better than relatime? (which you described an I/O pressure scenario and delayed writes)

relatime:

  • Better when snapshots are created frequently as reads are less frequent, as low as once per day? So regardless of number of snapshots over a duration, reads alone after each snapshot won't result in new atime updates wasting disk space unexpectedly (from a typical user perspective).
  • The benefit from strictatime is reduced once snapshot frequency is daily or greater.

In your initial comment you mentioned:

relatime may make this effect worse for some users, by batching up a lot of atime updates at close to the same time each day (e.g. when the user logs in at the start of the work day, triggering the once-every-24-hours update behavior of relatime).

Which I thought was comparing to strictatime, and a misunderstanding of delaying an atime update being written up to 24 hours... you meant the delay as a throttle (perform straight away but no more than X times within Y duration), not delaying the write until the duration expires (which lazytime appears to if no other conditions are met earlier).

So relatime or strictatime, when either would write an atime update, they do so upon the read and the mentioned IO pressure concern for certain disks can occur. lazytime doesn't really help, only noatime?:

The impact of all this is relative. Metadata updates on random pages are extremely expensive on big, slow spinners, and not at all expensive on small, fast NVME devices.

This can result in the entire system freezing up for some time--from a fraction of a second on fresh installs to NVME devices, to several minutes on larger and older filesystems on slower spinning disks. This is true of any large metadata update--atime is just one special and easily avoidable case.


For tmpfiles and automatic file reaping you should not use a snapshot. The snapshot will prevent the file reaper from freeing any space (reaping activity will allocate more space instead as nothing is freed, and metadata pages are CoW-modified).

I just mention it as an example of something that can be configured on a system. The user may not even be aware of it when considering if they should use strictatime or relatime.

systemd-tmpfiles --cat-config | grep -v ^# | sort -u lists many locations and files that it affects.. If you want to selectively avoid snapshotting such locations you're probably going to have a more nuanced subvol layout/config? Or not use snapshots much if atime matters to you.

For a tmp reaper there is not much practical difference between ctime and atime for selecting old files. If you need a file to persist for a long time, why is it in /tmp?

It's not just /tmp which is often on tmpfs anyway and a non-issue. The command I shared above shows various locations nested in /var, /run, /etc /dev, even /srv and /home are listed... although I haven't quite looked into how to interpret what those config lines are doing for the service.

One example lennart mentioned was coredump files. Without atime updates to determine age, it will also fall back to mtime or ctime, however you may access something older that an atime update would have provided a younger age and avoided reaping.

One case that comes to mind for that would perhaps be caches such as file browser thumbnail previews. If you were using this service, you'd perhaps want to retain frequently visited locations thumbnails, rather than generate those again when atime for less common thumbnail previews could be reaped instead.

Not the best example, and perhaps of little value to snapshot, but you might have that location as part of a broader subvolume. Perhaps there are some scenarios where you may still want snapshots of that subvol, but clean those up on some scheduled basis to discard older cache?

systemd-tmpfiles could be beneficial for that, so long as the user is aware of it and enabling atime for that, they can go through the extra effort to minimize any drawbacks from that content being captured in a subvol snapshot (if any).


btrfs does have cases where it's at least possible to resume a build after a power failure, and lazytime breaks those.

To clarify, that's due to mtime and ctime not having been updated on disk right, not atime? Or also an issue with atime?

For mtime and ctime updates, these would get written to disk not long after on a standard system (eg 30 seconds dirty writeback?), which is still a risk, but the window for it being out of sync would be small.

I know that it's advised to perform a scrub after a power loss event (and presumably crash like a kernel panic), is BTRFS not able to detect the problem and provide a last known good state in this situation when lazytime is involved and causes a consistency issue?

Pragmatically how much of a meaningful data loss is risked? (this make operation example isn't losing anything important in practice?)

With lazytime, the timestamps may not be updated at the same times as the file contents, so scenario 1 and 2 may produce different results.

When you say produce different results, do you mean with the metadata difference alone, or the state of data that this example was in the process of generating/performing?

I'm having a bit of difficulty seeing it as a pragmatic example, since from my understanding you'd generally restart such a process from a clean state if a failure botched it previously.

Likewise something like a system update which could presumably be a related scenario would perform in the same manner (often with extra effort on BTRFS to have pre and post snapshots)

polarathene avatar Jun 19 '21 10:06 polarathene

Provided nothing else is keeping the file open, lazytime is not reducing atime updates to disk that much more than relatime alone would, given the find / bulk read example? is this what you meant by lazytime potentially reducing writes for atime with reads? (even if the delay may not offset a write by much longer)

Yes and yes.

One could presumably recreate the snapshots or some other workaround after the fact that excludes the atime metadata, it just isn't likely a desirable workaround to apply.

It is very hard to get the space back without deleting every snapshot. If even one old snapshot still exists, it will keep a full copy of the old metadata on disk. btrfs provides no way to recombine metadata between subvols once it has been unshared.

On xfs, subvol metadata is parent volume data, so it's possible to dedupe the metadata, maybe also take the snapshot offline for external editing...but probably still quite impractical.

If you want to selectively avoid snapshotting such locations you're probably going to have a more nuanced subvol layout/config? Or not use snapshots much if atime matters to you.

The size also matters. Most users would not be bothered by 1500 16K thumbnail files, even if 100 snapshots contain different sets of thumbnails. They might be bothered by 100 snapshots of 160MB core files as they will use non-trivial space, and recovering that space requires deleting every reference in a snapshot (either by mutating the snapshot or discarding the entire snapshot). But that also depends on the filesystem size--it could be a disaster for a 128GB filesystem, and utterly trivial on a 16TB disk.

I'm mostly thinking about free-space-demand-driven tmpreaper use cases. We have various large caches where we retain data as long as free space allows, and we do have to make sure those aren't ever snapshotted because we need to free a lot of space very quickly. We build the system around these caches, so they have properly non-overlapping subvols set up.

It would be nice to have an inheritable attribute (as we do for datacow and compression) that controlled atime, so it could be selectively enabled in the specific places where it is needed, and disabled by default. We currently have an attribute that can only disable atime on specific inodes (so we can't mount noatime and then turn atime on for specific directories) and is not inheritable (i.e. files created in noatime directories do not get the noatime property, it has to be set for every inode individually).

To clarify, that's due to mtime and ctime not having been updated on disk right, not atime? Or also an issue with atime?

mtime and ctime. atime usually doesn't matter to a build. lazytime in theory affects all 3 timestamps, so it's quite different from the other atime-related options.

For mtime and ctime updates, these would get written to disk not long after on a standard system (eg 30 seconds dirty writeback?), which is still a risk, but the window for it being out of sync would be small.

The problem is that the risk is not exactly zero.

There is normally no separation between data and metadata update on btrfs for a single write() call. Both data and timestamp are updated, or neither, because the write() call holds a lock that prevents btrfs from committing a transaction between the two updates. lazytime would cause the mtime/ctime update to occur after the return of the write() call, possibly in a separate transaction from the data update, so the risk of the timestamp being out of sync with the data becomes greater than zero.

(Note that all of the above also applies to the i_version field too. NFSv4 relies on every filesystem to persistently update i_version to keep client-side data caches consistent. It is updated almost as often as the mtime field)

It turns out that this is moot on btrfs. btrfs has its own update_time implementation, and that implementation does not implement lazytime for writes. As I mentioned earlier, the inode and data references frequently occupy the same metadata pages, so very little IO saving is possible; however, there are benefits in some cases, so it's worth analyzing the effect of a hypothetical btrfs lazytime implementation in case the feature is implemented some day.

btrfs uses all generic VFS code to handle atime updates, so lazytime does affect atime updates on btrfs, just like all the other atime options; however, since a read() updates atime and nothing else, no inode/data inconsistency is possible.

I know that it's advised to perform a scrub after a power loss event (and presumably crash like a kernel panic). is BTRFS not able to detect the problem and provide a last known good state in this situation when lazytime is involved and causes a consistency issue?

Quite the opposite. The normal post-power-failure recovery procedure for btrfs is to turn the power back on and resume normal work immediately. Regular scheduled scrubs are important for discovering disk failures, but there is no benefit from scrub that is specific to power failure or crashes. It won't hurt to start a scrub after a power failure, but the post-power-on scrub won't have any useful effect unless the disk happened to start failing some time after the previous scrub.

btrfs uses transactions to ensure data consistency (not just metadata) after crashes and some hardware failures. This is a requirement of the data csum feature, where data and metadata are updated in lock step. btrfs does this without the performance hit of -o data=journal or -o sync so it is the default behavior. After a crash, applications get a point-in-time snapshot of their state on disk with no reordering of operations (-o flushoncommit) or no reordering except delalloc writes (-o noflushoncommit).

With flushoncommit, an application that stores persistent data on only one btrfs filesystem can't tell whether the application was previously terminated by SIGKILL or a host power failure/crash event, unless the application can access storage outside of btrfs filesystem, or the application intentionally bypasses the transaction mechanism on btrfs with nodatacow, or the application exploits a bug in btrfs.

lazytime would introduce a detectable deviation from the normal btrfs behavior, as it would allow data to become inconsistent with its ctime/mtime timestamps where this is currently not possible. Applications would then be able to behave differently if they are restarted after a power failure vs after being killed by SIGKILL.

From an ext4 user's point of view this concern may seem trivial, as ext4 users can expect far more data loss, inconsistency, and corruption to occur after every crash, and this is easily detectable by applications that modify data without using fsync. Indeed, on ext4 it requires skill and discipline to build applications that can avoid catastrophic failure after a crash.

On btrfs, a single-bit post-crash data/metadata inconsistency or reordering of the effects of mutating syscalls would be considered a serious bug. The same tree update code is used for snapshots in btrfs, which are supposed to be atomic, suitable for making consistent backups. If the post-crash filesystem content is wrong, the backups are probably wrong too, so the post-crash behavior is a first-class feature that is thoroughly regression-tested.

When you say produce different results, do you mean with the metadata difference alone, or the state of data that this example was in the process of generating/performing?

The output could be different due to inconsistencies between data and timestamps after the crash. If we have a dependency chain where A depends on B, and B depends on C, and C is updated, then lazytime results in a possible post-crash state where B's contents are updated on disk, but B's timestamp is not. After the crash, A may not be rebuilt with the new contents of B, because A's timestamp is newer than B's (old due to lazytime and the crash) timestamp.

Note that properly written Makefile rules will simply rebuild B again after the crash, since B's timestamp is older than C, so we can't have incorrect output with just Make and lazytime by themselves. If C is updated by some other tool, such as rsync or git checkout, just before the crash, then the update may not be repeated because other tool notices the file contents (not timestamp) are up to date. Make relies on file timestamp (not contents) to determine whether to execute build rules, so after the crash, Make sees that C is older than B, and B is older than A, so A is not rebuilt when it should be.

A more complicated dependency chain might result in a binary being built with .o files that contain code built from two different versions of a header file that defines the interface between the .o files (the make run before the crash updates one .o file, but the crash occurs, the timestamp on the header is lost, and make run after the crash does not update the other .o file using the same header). In that case the resulting binaries would have undefined behavior, which is generally considered very bad.

Backups are a better example. Tools like rsync and file indexers tend to rely on mtime as a kind of unique content identifier. They assume equal mtime means equal file contents, and will not propagate updates to backup without a mtime change. If lazytime drops a mtime change, then the backups don't get updated, and there is potential data loss if the backups need to be restored. (For clarity to those reading this text out of context: this would not affect btrfs send backups, which compare the file metadata directly, and do not rely on timestamps.)

Since btrfs doesn't implement lazytime for writes, the above is purely theoretical.

Pragmatically how much of a meaningful data loss is risked? (this make operation example isn't losing anything important in practice?)

One problem with my Make example is easy to analyze and predict, and the problems are easy to work around--and that makes it a bad example.

Real life isn't this predictable. A big app box running a dozen apps for an enterprise typically has a dependency on unsynced write ordering somewhere. The people running the box most likely don't know where that dependency is until something breaks on a power outage--and may still not know, even after fixing the resulting problems.

Pragmatically, over the last 5 years we spent about 30 days per year performing post-crash recovery actions on ext4 and xfs filesystems (comparing backups, scrubbing, auditing databases, restoring backups, manually fixing critical services, purging and repeating any large file tree updates that were in progress at the time of the crash, and generally trying to understand what went wrong with each individual application), compared to 0 on btrfs (we do all the checks, but don't find any problems to fix). We are very conservative about changes that make btrfs behave even a little bit like ext4.

Zygo avatar Jun 21 '21 03:06 Zygo

@Zygo thanks for the awesome insights and examples! Very informative and helpful :)


I will try to tersely summarize the problem and advice for someone making a decision:

Is atime a bad idea with BTRFS? Should I prefer noatime?

noatime is often a better choice.

This is only relevant when a file is read and the metadata is shared with at least one snapshot. Updating the atime in metadata then results in requiring disk space for a new copy of metadata to accommodate the change, while keeping the old atime for any snapshots that still reference it.

With enough files read or snapshots over time, it can become a noticeable amount of disk space wasted. Using noatime can avoid this. Although if the affected snapshots are disposable / short-lived, you can keep the default relatime and just delete older snapshots when you need the space.

It is easier to switch from noatime to relatime/strictatime (avoid lazytime as it risks data consistency guarantees) than it is to switch the other way around (due to snapshots "polluted" with metadata containing only unwanted atime updates).

More verbose version

The atime + snapshot disk usage problem

  • Metadata for files on BTRFS can be shared across multiple snapshots efficiently when it's identical.
  • Reading a file can trigger an atime update write to the metadata on disk.
  • Modifying the metadata loses the benefit of sharing with snapshots, CoW causes new writes to update the metadata, using up extra disk space.
  • If the file metadata to be updated is not shared with any snapshot(s), no extra disk space is needed to write the update.
  • The disk space can be recovered when all snapshots referencing the same copy of metadata are deleted.
  • The problem is the old metadata is tied to any other content in your snapshots that may be of value to retain, there is no way to clean up space by removing only the redundant metadata in snapshots.
  • Thus the problem compounds if the frequency of atime updates affects many snapshots over time (as in each snapshot has different metadata for a file due to different atimes).

relatime, strictatime, lazytime vs noatime

  • relatime minimizes atime updates (and thus their precision) which is beneficial with frequent use of snapshots.
  • If snapshot intervals are daily or greater, strictatime can provide more atime precision at the expense of writing to disk with each file read.
  • Many atime updates for the same file(s) does not compound the disk usage increase if there is no new snapshots taken. Only the first atime update for a file incurs the penalty from no longer being able to share the same data with a snapshot(s).
  • lazytime behaviour does not affect writes on BTRFS, and does not provide much benefit over relatime for reads. Additionally it can risk consistency guarantees that BTRFS is intended to provide for filesystem state.

noatime avoids the problem since no atime updates occur. Some software like systemd-tmpfiles may not function as optimally/accurately, but otherwise most users aren't likely to notice.


Let me know if this sounds right and we can wrap up the issue? :)

polarathene avatar Jun 21 '21 08:06 polarathene

I think it's a good summary, but it's mostly summarizing stuff I wrote, so I'm biased. ;)

Zygo avatar Jun 22 '21 18:06 Zygo

@polarathene thank you so much for the TL;DR and @Zygo thank you so much for not only answering my question, but also going in depth. The whole discussion was worth reading.

I'll wait for a member of the repo to close the issue in case they want to add/correct something.

TheEvilSkeleton avatar Jun 22 '21 21:06 TheEvilSkeleton

While having active access times does slow any fs this discussion is giving the impression btrfs has a significant problem with access time updates. The risk seems to be that it could make snapshots significantly heavier and reads significantly slower than other fs.

Since relatime is btrfs default setting, should not test results be included to advise against it or the advice be caveat-ed as theoretical ?

strainer avatar Mar 09 '24 02:03 strainer

@strainer I thought my last comment covered the concern fairly well?

If you frequently take snapshots where metadata has been adjusted primarily only for atime along with some of the snapshots having actual meaningful changes interleaved along the way, IIRC that was known as a perf issue for reads when you have a large snapshot history. Not so much the atime itself, just that it introduced the overhead by all the redundant CoW copies of metadata, instead of the multiple snapshots being able to share what would otherwise be the same data.

You should be able to find historical examples of the perf impact regression as the number of snapshots increased (unrelated to atime specifically). But it shouldn't be an issue if you don't accumulate a large number of snapshots, the concern with atime was how it contributed to wasting storage from redundant snapshots of metadata. It doesn't matter about any successive updates inbetween snapshots.

If things have changed since for the better, awesome! If you want to put together a test that disproves what was discussed here I'm sure that'd be appreciated to 👍

polarathene avatar Mar 09 '24 04:03 polarathene

While having active access times does slow any fs this discussion is giving the impression btrfs has a significant problem with access time updates.

This impression is not wrong. Other filesystems don't consume disk space when reading files with atime enabled. Other filesystems don't need 144KiB (single page update in 3-level tree) to 43 MiB (300 item update in a shared snapshot leaf page, where every item references an object in a different page) of random-order IO to update one inode. btrfs has between 1 and 4 orders of magnitude (powers of ten) higher IO overheads than other filesystems have for atime updates.

The default option only really makes sense if you're going to mount the filesystem once, but this is increasingly not the norm. Mount namespaces and bind mounts can keep applications like file indexers and backup software accessing the filesystem through noatime mount points (especially if they will be touching anything that has a snapshot somewhere) without affecting the *atime option of the mount points accessed by other applications on the system. Conversely, for things like systemd-tmpfiles, it can be useful to explicitly turn on atime or relatime when the last access time is especially relevant. This is easy to do if you're already using options like nosuid,nodev,nosymfollow,noexec for service-oriented views of the filesystem to minimize privilege escalation opportunities and TOCTTOU issues.

Zygo avatar Mar 09 '24 08:03 Zygo

Thanks for the response Zygo, I appreciate your deep level of insight and confidence on this, but I am doubtful without seeing measurements of how significant the performance and space usage impact can be. Since measurements can be surprising.

There are a number of old conversations about this problem, without any decent data on it. Google results highlight 'Forza Ramblings' notice on btrfs mount options who writes:

This problem is compounded if there are many snapshots, as all references have to be updated as well.

That seems to be a common impression, and a 2012 LWN.net post describes a massive metadata increase from having atime with 10 snapshots and grepping just once, but it sounds plainly broken to me. I'm sure only the metadata of the active snapshot should be affected by atimes. That was covered earlier here I think, but it might be good to set straight.

I have not measured my situation, but I have not noticed problems in a couple of years, on my modest coding laptop using btrfs on SSD, from having relatime while keeping a few daily and weekly updated snapshots and a few older snapshots for 'timeshift'

And I'm not able to properly debate the technicalities, but the btrfs metadata I/O load that you highlight - it is cached. It is cached well enough that random 4k write performance is recently benchmarking around 50% the speed of EXT4 and XFS. I would not expect that level of performance to follow from that description of btrfs I/O overhead. No doubt the description is accurate, but it does not entirely represent the efficiencies which btrfs has developed over the years.

Quite possibly I am just living oblivious to the performance issues btrfs has with snapshots and atime, and being ignorantly contrarian, but I expect there should be some issue reports or some test results if it is a significant problem in practice. Since this is quite a negative claim about the filesystems capability, it should be substantiated by tests.

strainer avatar Mar 09 '24 22:03 strainer

Polarathene - sorry I missed your post before replying. I will not belabor points here, but this reference in your earlier post - reference: BTRFS requires noatime (2017) It has no substance - all that is shown is that an rsync between BTRFS and XFS can have 60 MBs on btrfs and 17 MBs on the XFS. Thats completely opaque it does not measure anything, because no info is given on the rsync options, or on which drive is master. Yet it is declared "a case study". And 60 MB/s does not amount to an SSD I/O "problem". The test has to inform the claim. If the reference claimed the rsync operation took 5 minutes with noatime and was then carefully redone with atime and took 25 minutes - then we'd have an issue documented.

And the Fedora discussion, same thing, no issues documented just ideas. Fedora and other big distros have been shipping for years with the btrfs default value of relatime. Many people are keeping multiple snapshots and using timeshift as it is a killer feature. Somehow we are convincing about a problem here, yet have no decently documented problem reports or tests to refer to - there is an actual dearth of evidence for this idea.

strainer avatar Mar 09 '24 23:03 strainer

Polarathene - sorry I missed your post before replying.

I was referring to my comment prior to yours joining in the discussion. It has no links, just summary of context with conditions where the atime concern is relevant.


this reference in your earlier post - reference: BTRFS requires noatime (2017) It has no substance - all that is shown is that an rsync between BTRFS and XFS can have 60 MBs on btrfs and 17 MBs on the XFS. Thats completely opaque it does not measure anything, because no info is given on the rsync options, or on which drive is master. Yet it is declared "a case study".

They describe internal SSD (BTRFS) => external USB (XFS), where given the I/O perf and only identifying BTRFS as an SSD, the USB disk was likely an HDD. USB SATA bridge chipsets also have other quirks that impact perf beyond the HDD limitations, in both cases random writes aren't ideal.

So in this scenario, rsync is only reading from BTRFS to update on the XFS disk. But due to atime, each file read is consequently a write to BTRFS and the author explains the overhead concern:

  • relatime is active, resulting in atime updates for effectively each file.
  • An atime update of a snapshotted file only updates the atime of one copy. Thus, in the likely case that the filesystem sector with the atime is still shared it has to be copied on that atime write. Meaning 2 or more write operations as a result of one read.
  • If you are really unlucky, running a backup job (or even just a grep) on a Btrfs filesystem mounted without noatime might even yield out-of-space errors in case all the copy-on-write atime updates exceed the available free space.

As @Zygo recently replied regarding the metadata size allocation, you can see how that ties into that final concern I've quoted above from that article. You're welcome to reproduce that for yourself if the information is not sufficient for you to trust?

And 60 MB/s does not amount to an SSD I/O "problem". The test has to inform the claim. If the reference claimed the rsync operation took 5 minutes with noatime and was then carefully redone with atime and took 25 minutes - then we'd have an issue documented.

The focus wasn't about SSD transfer speeds, it was about the I/O overhead of atime updates with common enough operations like rsync or grep triggering those atime updates rather than the a specific set of options for the command. You just need a similar environment with BTRFS, a snapshot and many files to access that would only be a read with noatime, but introduce a write with relatime.

If you want to devise a more specific test, by all means go ahead with that but there can be plenty of other variables involved beyond this that you may want to configure an entire virtual machine with ram disks with I/O limits if you want others to better reproduce and discuss without other factors like their system environment or hardware differences muddying observables.

The article does compare relatime vs noatime improvement:

Resolution: After canceling the rsync command, remounting the Btrfs filesystem with noatime and restarting the rsync job it sure enough performs much better, i.e. the numbers reported by rsync match the dstat ones, as expected.

  • relatime: 17MiB/s write with rsync to the XFS disk with BTRFS reads at 60MiB/s.
  • noatime: 60MiB/s write to XFS disk, matching the 60MiB/s read from BTRFS.

That's roughly 3-4x improvement, the total duration isn't too important, if this is fairly consistent through the transfer it's obviously going to complete much faster with noatime relative to the relatime duration.

The internal SSD that BTRFS is using here is obviously a SATA based one, and while I got the earlier impression of XFS as an HDD (or the USB chipset being a bottleneck), it only occurred to me here that the 60MiB/s read from the SSD is not surprising for random I/O (look up rand 4k benches for SATA SSDs, with a Crucial MX500 being one of the better ones), so the USB disk itself may not be the actual bottleneck in I/O there.


Many people are keeping multiple snapshots and using timeshift as it is a killer feature. Somehow we are convincing about a problem here, yet have no decently documented problem reports or tests to refer to - there is an actual dearth of evidence for this idea.

You have plenty of information here on where the concern is relevant.

If you want to verify it in a way that is satisfactory for you as evidence, create a VM with BTRFS, populate it with a large number of files, snapshot it, perform an operation that updates the atime of those files, snapshot, repeat.

As those snapshots accumulate, you should be able to observe the impact on storage requirements and I/O provided you're using tooling to measure it appropriately.

The concern itself can be contextual. Not everyone is creating snapshots at such a frequency without an expiry + performing operations that affect a large number of files to compound the effect. relatime as noted already minimizes the issue too, along with nodatacow (which is temporarily is treated as COW if snapshotted) for large files with heavy write activity (this type of file content unrelated to atime is where a large chain of snapshots without nodatacow is problematic IIRC).


There's also pragmatic context to consider.

You'll often see talk of SSDs not needing to be defragged unlike HDDs. The data still can get heavily fragmented, but the user won't have as notable of a latency hit with an SSD, especially with I/O perf on modern disks these days most reads will still appear quite quick even when the actual performance has regressed heavily.

With BTRFS you'll see some users happily using databases and VMs without nodatacow citing they experience no performance issues. Realistically they've probably not measured the impact, it's just not regressed to a point with their usage and hardware that it's noticeable to them.

I think this is the point you're trying to contribute to the discussion? The problem with atime updates discussed here while legitimate, you are more interested in knowing how likely you're to experience the impact? Hopefully this response illustrates that it's really going to depend on:

  • How you use your system (to trigger the atime writes + snapshot management)
  • How your system is configured and it's capabilities (hardware, tuning/customizations)
  • What your expectations are pragmatically (what I/O perf or disk-use impact is actually meaningful to you)

If you have an abundance of disk-space (I don't), or for example an operation regresses from reading a 6MB file in 25ms (200 fragments) to 60ms (1000 fragments) doesn't bother you because you're not going to notice that kinda regression most of the time on a personal system, then you don't need to worry about why these regressions happen and how to avoid it (YAGNI).

For others (who may have already encountered problems like this and it matters to them why), the information here provides the insights for understanding the problem and how to avoid / minimize it 👍

polarathene avatar Mar 10 '24 00:03 polarathene

I'm sure only the metadata of the active snapshot should be affected by atimes

That's true, but I'm not sure you fully understand the implications.

The metadata of an non-read-only subvol is unshared by atime updates when it is read. This can be triggered by running locate on the original subvol between snapshots. This is the only way this unsharing can happen if the snapshots made from the original subvol are all read-only. In this case, there will be a slow linear growth of metadata over a period of days or weeks, as opposed to a sudden multiplicative increase over a period of minutes or hours.

If the snapshots are not read-only, then reading the snapshot also unshares the snapshot's metadata with atime updates. In this case, there can be a sudden growth that is multiplied by the number of snapshots that were read. This can happen if the snapshots are not read normally, but one day the user decides to run a find over them to locate a file, and triggers unsharing of all the existing snapshots at once.

A subvol's metadata can be unshared completely by accessing less than 1% of its contents. The directory atimes alone can unshare a subvol completely if it has an average count of fewer than 100 files per directory and a utility like find or locate reads all the directories. nodiratime can prevent this, while still allowing files to have atime updates.

Sudden metadata growth can force a filesystem into read-only state, and continued mounting with atime updates enabled can keep it in that state. It is a serious concern for smaller filesystems that tend to have less available space to absorb sudden metadata growth.

the btrfs metadata I/O load that you highlight - it is cached.

The cache doesn't eliminate the IO cost--it amortizes and defers the cost.

In the 3-level filesystem example (typical for a filesystem over about 10 million fragments), the minimum cost to update one inode is 144K. With caching and deferred inode updates, in the best case about 100 or so inodes can be combined in the same transaction update (if the files are empty, the inode numbers are consecutive, and the reads all occur within the commit interval), making the average write cost per inode 1.4KiB. These are all extreme values: a typical case will be somewhere between 1.4K amortized cost per inode in the best possible case, and 43 MiB non-amortizable cost in the worst possible case.

Which cost number you get depends on many workload-specific variables: how many files? how big are the files? on what schedule are the snapshots created? on what schedule are the files read? in what order are the data blocks written? for how many years has the filesystem had the same IO patterns?

In most filesystems, these questions don't matter very much, because the range of outcomes is relatively narrow: ext4's worst case might be 100x slower than its best case. On btrfs, the range is thousands of times wider than other filesystems. If you want to be informed by a published study or benchmark, you need to closely match the study's test conditions to your intended use case, or the results simply won't apply to you.

random 4k write performance is recently benchmarking around 50% the speed of EXT4 and XFS....the efficiencies which btrfs has developed over the years.

Earlier versions of btrfs had performance significantly lower than ext4 and xfs because btrfs is unable to make use of more than a few CPU cores, and it needs a lot of CPU time on the cores it can use. Better algorithms and reduced waste (and more performant hardware) have resulted in performance gains for btrfs in recent years, but most of the gains are relative to older versions of btrfs, because ext4 and xfs simply never had the scaling problems btrfs did (and still does).

None of those changes have any meaningful impact on IO cost in btrfs, which hasn't changed because the on-disk data format hasn't changed significantly since 2009. There are some efforts to improve the on-disk data format, but that requires more than merely updating the kernel: existing filesystems must to be converted to something incompatible with current btrfs before any gains can be achieved.

AFAICT there's no improvement planned that would affect atime update cost. e.g. putting inode atimes in separate trees might reduce the worst case amortized IO cost to 144K per inode, but it would add new problems that are much worse (e.g. adding an order of magnitude to the cost of every stat call).

my modest coding laptop using btrfs on SSD, from having relatime while keeping a few daily and weekly updated snapshots and a few older snapshots for 'timeshift'

I'd expect you'll have one IO stall per day, it'll be a second or less on a modern NVMe, and it very likely occurs at a time when the IO subsystem is otherwise idle, because it will be idle about 99% of the time. If you happen to run the daily snapshot just before updatedb overnight, or if you don't use locate at all, then you avoid the worst latency effects of metadata sharing accidentally. You've been using it for years, which means it will have adequate metadata space allocated by now. Timeshift is the slow-linear-growth case, so there won't be any sudden increases in metadata size to worry about. I'm guessing it's not very close to full, or you would have had problems with it already.

It's a different story if your laptop uses MMC or spinning disks for storage (WD Blue and Purple devices provide some spectacular examples of this, but they're unlikely devices to find in a developer's machine), or if you're running a CI build workload as opposed to a developer build workload, or if you've just deployed a small VPS instance and you'll fill its filesystem to 90% in the first hour, then run out of metadata space the following day when relatime updates kick in.

What we typically see in the field is a node running a read-heavy high-file-count workload encountering unexpected stalls due to IO saturation on the storage devices. The IO stalls disappear immediately after switching from relatime to noatime, without changing anything else. We don't immediately rush to publish that finding--the cause is obvious (especially if you look at kernel stacks during a stall), it's a known issue that is decades older than btrfs, and it happens every day until we take action to make it stop. It's lived day-to-day experience, not a subject for a rigorous academic paper.

We now proactively set noatime without waiting for a node to have a performance problem first. There is no need to test new kernels for improvements in atime behavior because they are extremely unlikely. If btrfs eventually supports a new on-disk format, we'd evaluate that change as we would evaluate any other competing filesystem; however, regardless of the result of that evaluation, we'd most likely continue to use noatime anyway, because noatime always results in a net reduction of IO cost, and we're sensitive to IO cost because we continually hit IO capacity limits. A performance improvement means we could use fewer devices for our workload, but we'd still be saturating the IO on each one. In the rare cases where we have a hard application requirement for atime, we enable atime updates only in the application's private mount namespace, and only on paths where the application makes use of the atime data.

Zygo avatar Mar 10 '24 08:03 Zygo

Sorry that is much to digest and reply too.

So I ran a couple of tests on my modest laptop's 250GB SSD btrfs root volume, which has been used and abused almost daily for 2 years, having crashed numerous times (system & power), and run flat out of space several times from defragging with snapshots. This btrfs volume has always been mounted with relatime default. The tests were run with 4 snapshots created within the prior days and weeks and one fresh snapshot just before the tests.

So it took 5 seconds to find and (sudo) touch 266 thousand files within /usr/share . With relatime updating their access time the operation increased the entire volumes metadata usage from 1.9GB to 2.1GB.

/usr/share sudo find . -type f -exec touch {} +

The command completed 3 seconds faster when it wass re-run immediately, and metadata usage did not increase. Presumbly there was no access time update 'overhead' the second time.

Before I ran that test I grepped 120 thousand files that total 11GB within ~/.cache This operation took 406 seconds and increased metadata from 1.74 to 1.91GB

~/.cache grep -aPor 'patternToNeverFind' *

When immediately re-run the operation did not increase metadata and took a whopping 18 seconds less to complete. 18 seconds off 400 would be the maximum time which could be saved by mounting with noatime.

These two operations updated access times of 386 thousand files (about 1/3rd of the filecount on /home and root combined. This was a targetted attempt to create a problem with atime in a real system. The problem amounted to several seconds of delay. The A.T. updates to a third of the volumes files increased metadata usage from 1.74 to 2.1 GB. Thats about 20% of metadata size. The drive is about 200GB full of data, so its about 0.15% of the drives space usage.

On the back of these observations I would recommend to most users not to be concerned with btrfs handling of atime or relatime. It is a very mature and optimised filesystem tested by hundreds of thousands of working systems. relatime is default, leave it that way unless optimizing for commercial I/O compute performance, or to get life out of old hardware.

tldr, - If you mount btrfs with noatime, you may occasionally save 5% of the time it takes to make coffee.

strainer avatar Mar 10 '24 22:03 strainer

modest laptop's 250GB SSD

Since you raised a concern with prior citations being too vague, you should indicate here if you were using an NVMe or SATA SSD, there is a notable difference not just in throughput but queue-depth.

If it is SATA, you may have more luck at reproducing the rsync article I detailed in my previous comment.


Another data point to include is context of what your system represents workload wise? You note only having a few snapshots, which we already established earlier with relatime isn't likely to be a big concern.

You note metadata usage going up by 200-400MB in your atime updates, if you retain a larger number of snapshots, you can see how that would contribute to wasting a reasonable amount of space each time?


As was detailed before you joined in the discussion, the concern is relevant to snapshot count size over time with impact of atime wastefully affecting disk usage that otherwise would be unaffected with noatime.

The other concern as explained with the rsync reference is IOPs related, but will likely be context dependent.

If you don't have a known issue with relatime, I'll quote myself from earlier:

If you have an abundance of disk-space (I don't), or for example an operation regresses from reading a 6MB file in 25ms (200 fragments) to 60ms (1000 fragments) doesn't bother you because you're not going to notice that kinda regression most of the time on a personal system, then you don't need to worry about why these regressions happen and how to avoid it (YAGNI).

polarathene avatar Mar 10 '24 23:03 polarathene

Here is my developer box: filesystem is on a RAID1 pair of Seagate Ironwolf and WD Red SATA SSD, both rated for NAS usage. It has been mounted with noatime since mkfs, and has a snapshot from almost every day (112 snapshots for 141 days). The main subvol has 1938520 files, consisting mostly of Linux kernel build trees. Including snapshots, there would be 200 million distinct inodes if atimes are updated. Kernel version is 6.7.9. Host is a Ryzen7 CPU with 16 GiB of RAM.

Pre-test condition:

                              Data      Metadata System
Id Path                       RAID1     RAID1    RAID1     Unallocated
-- -------------------------- --------- -------- --------- -----------
 1 /dev/mapper/devel0617-tvdb 543.00GiB 70.00GiB   8.00MiB   126.59GiB
 2 /dev/mapper/devel0617-tvdc 543.00GiB 70.00GiB   8.00MiB   110.71GiB
-- -------------------------- --------- -------- --------- -----------
   Total                      543.00GiB 70.00GiB   8.00MiB   237.29GiB
   Used                       518.19GiB  8.65GiB 112.00KiB

I ran mount -oremount,relatime . on this filesystem, then start find -type f -ls | wc -l at the top level. After an hour, I get:

                              Data      Metadata System
Id Path                       RAID1     RAID1    RAID1     Unallocated
-- -------------------------- --------- -------- --------- -----------
 1 /dev/mapper/devel0617-tvdb 543.00GiB 92.00GiB   8.00MiB   104.59GiB
 2 /dev/mapper/devel0617-tvdc 543.00GiB 92.00GiB   8.00MiB    88.71GiB
-- -------------------------- --------- -------- --------- -----------
   Total                      543.00GiB 92.00GiB   8.00MiB   193.29GiB
   Used                       518.19GiB 54.26GiB 112.00KiB

To test latency for threads that want to write to this filesystem, every 5 seconds I run time sh -c 'mkdir test; rmdir test' on the filesystem while the above find command is running. Writes to the filesystem are blocked for 4 seconds every 10 seconds, with a 12-14 second stall every 30 seconds for transaction commit. Every ten minutes or so, there's a longer write stall lasting 2-6 minutes.

The SSDs are being hit with 70 MiB/s of writes continuously, except occasionally when the kernel gets CPU bottlenecked on btrfs workqueues, and IO bandwidth drops to near zero for a while.

With the exception of the mkdir and rmdir, 100% of the workload on this filesystem is reads, i.e. all of the writes come from atime updates.

Note that the find here is doing only readdir and stat(). No files are being accessed yet, all the chaos here is coming from atime updates on directory inodes (the real-world equivalent of this is when user has run git status on a checked out directory for the first time today). Reading a byte from each file would make the metadata grow even faster, but the metadata growth is already fast enough to kill the filesystem, so adding files to the test would be gratuitous.

The find is about 10% of the way through the filesystem after an hour, so after 10 hours there would be about 450 GiB of metadata, but the filesystem would run out of space in about 3 hours at this rate. If I added some more disk space so that the test could finish, roughly half of the filesystem would be occupied by atime updates by the end.

If this test was allowed to proceed until failure, it would force the filesystem into a read-only state that can only be recovered--maybe--by removing snapshots or adding more space. ("Maybe" because sometimes we have to patch the kernel to get the filesystem back when this happens.)

If the filesystem had been mounted with relatime from the beginning, those IO stalls would have been affecting the machine every day, and we'd have slower performance overall just due to the sheer size of all that metadata, and its effect on everything from tree search performance to SSD write wear levelling. Also the filesystem would have filled up months ago.

Zygo avatar Mar 11 '24 02:03 Zygo

Update (and corrections to predictions)

After 3 hours there were 122 GiB of metadata, and the find -type f -ls had finished running. I started running find -type f -exec head -c1 {} + >/dev/null to trigger atime updates on plain-file inodes.

Performance decreased somewhat over time. It took 10 hours from the beginning of the test to finally running out of space, about 3x longer than expected based on the performance of find from the first hour. mkdir/rmdir stalls became longer, with stalls up to 8 minutes in hours 5 and 6, and up to 24 minutes in hours 9 and 10. There were 41 stalls longer than 2 minutes.

The filesystem forced read-only due to running out of space for metadata, terminating the test. After umounting and mounting again (with noatime), no special recovery was needed.

Unexpected results
  1. 6 GiB of metadata space are free. I expected ENOSPC at slightly above 512M (the global reserve size). That might be a kernel regression, but maybe the tree update transactions generated by this test are over 6 GiB in size, and the last one was aborted.
  2. The first hour was much faster than the last one, but the 8th hour was faster than the first. I expected monotonic performance loss, with either a linear, exponential, or asymptotic curve.
  3. Metadata growth from diratime updates were about 30% of my interpolated estimate based on data at the end of the first hour (about 450 GiB), and about 15% of my worst case estimate (about 780 GiB) based on a model of btrfs metadata behavior and the total metadata tree size at the start. The model didn't include any unshared metadata pages between subvols at the start, which undoubtedly did exist.

One datum I didn't collect was memory usage. There were a number of dmesg messages about page allocation failures during the test. Kernel memory usage may be relevant for result 1 because memory size limits transaction size, or result 2 because memory exhaustion triggers early and repeated flushing of delayed metadata queues.

Different parts of the filesystem have different mixtures of directory inodes and non-directory btrfs items, and these can (and in real-world data, do) aggregate in zones large enough to throw off naive estimations based on sample data. This effect can contribute to both result 2 and result 3.

Zygo avatar Mar 11 '24 16:03 Zygo

I appreciate your effort Zygo, that was illuminating but I remain contrary.

Contrast our tests - I targeted causing an issue with access times on a mature volume that was always mounted with the default policy. My own Btrfs volume exhibited no big cpu or storage issue with rapid access time updates to a third of its files, while keeping a modest number of snapshots and having modest hardware.

Your test targeted and revealed an issue on a volume that was always mounted with a non-default policy to optimise performance. (It is a performance tweak to most filesystems) It then tried then to 'convert' in a rapid manner, the policy tweaked volume back to one that has updated access times, while having also an exceptional number of snapshots. The demo showed its not practical to do that, however it did seem to not break anything (fingers crossed). It just didn't have the time and space to complete the operation.

The Btrfs reaction to this test looks robust to me, just not optimised for the edge case.

There was some ambiguity what was really happening, for example this first command:

find -type f -ls | wc -l

The '|' pipe feature is asynchronous and wc should not be waiting for the find to finish there, so file times as well as directory times were likely being updated there. And it was only necessary to touch the files, wordcount would have been trying to read through the whole 540GBs of data. So the noted I/O stalls were due to combination of access time update and wc consuming as much file read as possible.

Regardless of that, the excessive cpu and metadata bloat was well demonstrated. It is certainly curious how the updating of access times, seems to entail more demand than if a minor edit where applied to the same number of files, in the presence of many snapshots. Perhaps things get in a tangle and it might be smoothed out in the future. But I did not try to grok all the preceding discussion and did not convince myself of a position on any of the expert technical statements made about Btrfs internals. I just note that no one still has demonstrated the btrfs default policy to be "a bad idea". No issues have been properly opened for it, yet it is in use in many thousands of Linux installations.

strainer avatar Mar 11 '24 23:03 strainer

while having also an exceptional number of snapshots

LOL, 100 snapshots is far from exceptional.

on a volume that was always mounted with a non-default policy to optimise performance.

You are arguing against yourself here. The filesystem was mounted with a policy that made a normal use case practical when the default would make it impractical.

The demo showed its not practical to do that, however it did seem it didnt break anything (fingers crossed)

Actually it has ruined the filesystem.

After attempting to remove a snapshot, the filesystem has now entered a state where it forces read-only due to ENOSPC during mount. The transaction is aborted, the filesystem reverts to its pre-transaction state, and a subsequent attempt to mount the filesystem again has exactly the same result. The test filesystem can no longer be mounted RW without a kernel patch. I expected this to happen during the test period, but it happened some time later as I attempted to free some space.

In btrfs support channels, we see people who have this happen to their filesystem, and copying the data to a new filesystem is the only way out of an outage.

My own Btrfs volume exhibited no big cpu or storage issue with rapid access time updates to a third of its files

Yes, if you are using less than 1% of your computer, you can enable relatime. You can enable quotas too, if leftover IO and CPU capacity bothers you.

didnt have the time and space to complete the operation.

It is true that the filesystem might not have entered this state if relatime was used from the beginning; however, the filesystem's data capacity would have been severely reduced to store the extra atimes, and the number of snapshots that could be kept online at 75% usage (and therefore the depth and granularity of possible restores) would have to be reduced to compensate. The storage cost is the same whether the space is allocated in 10 hours or 10 years. Time is irrelevant since the test could not be completed due to lack of space.

This is not a reasonable cost-benefit tradeoff, unless atime updates are somehow central to what the filesystem is used for.

There was some ambiguity what was really happening,

There is no ambiguity. You are not understanding correctly what the commands do.

find -type f -ls searches directories and performs a stat() call on each file, forcing btrfs to retrieve the inode data (without -ls, find will skip the inode fetch since the file type is available in the dirent structure). -ls isn't really needed for this part of the test--reading the filenames would be faster and sufficient.

wc -l is reading find's output and counting the number of lines (files).

No files are accessed (i.e. no file inode atime updates are triggered) at this point. That is what the later find -type f -exec head -c1 {} + >/dev/null is for. The head command reads the first byte from each file, minimizing overall read IO.

So the noted I/O stalls were due to combination of access time update and full read of all files.

I verify the design of tests by running perf, bpftrace, and observing kernel stacks. I am familiar with the implementation details of Linux filesystems and btrfs code in particular. The effects presented here arise strictly from atime updates and their downstream effects.

It is certainly curious how the updating of access times, seems to entail more demand than if a minor edit where applied to the same number of files, in the presence of many snapshots.

An atime update triggers CoW of the entire metadata page the inode resides in, recursively cascading upward to the tree root and downward to all leaves referenced by the page (of which there might be hundreds), and repeated at least 3 times (one for each tree that is modified--subvol, extent, and free-space). The cost of making a one-byte change (or even a 64KiB change) to a file is trivial compared to the cost of updating the file's timestamp. btrfs does a lot of work to amortize that cost (without which, btrfs wouldn't be usable at all) but the cost still must be paid.

This phenomenon is not specific to btrfs. ext4 on lvm snapshots have a similar issue: changing an inode forces a tree page allocation and write redirection, slowing down the filesystem and consuming space until the snapshot is deleted.

Zygo avatar Mar 12 '24 02:03 Zygo

EDIT: I'm replying at the same time as @Zygo apparently (just saw their response appear), so I'm potentially repeating them.


Your test targeted and revealed an issue on a volume that was always mounted with a non-default policy to optimise performance.

  • If you had relatime instead, you could still encounter the observed problem when the scope of updates isn't something you'd normally perform? It's not an unrealistic scenario that at some point you want to search your system for a file/content or anything similar that operates over a larger scope of the filesystem than you'd usually perform?
  • There was also the rsync example which was the opposite with relatime degraded performance, with expected performance from remounting with noatime.

My own Btrfs volume exhibited no big cpu or storage issue with rapid access time updates to a third of its files, while keeping a modest number of snapshots and having modest hardware.

TL;DR: Your findings don't dismiss those from @Zygo as his better aligns with the expected environment conditions for this to be a bigger problem. Notably your snapshot count is very low and I suspect you have an NVMe disk?

Verbose response

Your test was under conditions that we'd already detailed should not be a concern. You wanted to know if they were theoretical and critiqued the information and sources prior.

@Zygo had an environment that better demonstrates how just atime updates alone can produce the performance and disk usage concerns this discussion detailed earlier. It's not an outlier by any means, 100-ish snapshots isn't excessive, nor is the disk capacity or disk number. The hardware was modest, and can reflect personal systems fairly easily.

Compare the two environments:

  1. RAID1 pair of Seagate Ironwolf and WD Red SATA SSD

    SATA SSDs (assuming the Ironwolf is as well, not an HDD), RAID1 is presumably via software (via BTRFS) not dedicated hardware controller?

    @strainer you did not respond with the additional context I requested, where it's assumed you've got an NVMe SSD.

    This is a single physical disk vs two separate disks as another difference, but I'd say queue-depth is likely to play a role, along with reduced IOPs capability.

  2. a snapshot from almost every day (112 snapshots for 141 days).

    As opposed to yours:

    The tests were run with 4 snapshots created within the prior days and weeks and one fresh snapshot just before the tests.

  3. The main subvol has 1938520 files, consisting mostly of Linux kernel build trees. Including snapshots, there would be 200 million distinct inodes if atimes are updated.

    2 million files + 200 million distinct inodes from the update vs the smaller scale of your findings:

    These two operations updated access times of 386 thousand files (about 1/3rd of the filecount on /home and root combined


The '|' pipe feature is asynchronous and wc should not be waiting for the find to finish there, so file times as well as directory times were likely being updated there. And it was only necessary to touch the files, wordcount would have been trying to read through the whole 540GBs of data.

TL;DR: You misunderstood the command.

Verbose response

The find command is outputting a list of files:

  • That is what gets piped via stdout into wc -l as stdin.
  • wc -l will need the full output to provide the total number of lines it received.
  • It's just lines counted, not individual files. If wc buffers the stream it receives, it'll be counting each line as it's available, otherwise it will allocate memory until EOF if I recall correctly.

Definitely not reading file content, only metadata.

EDIT: Even I misunderstood, as @Zygo clarified that atime updates don't happen until the later find reads a single byte from each file 😅


So the noted I/O stalls were due to combination of access time update and wc consuming as much file read as possible.

TL;DR: Shouldn't have much to do with wc at all. More likely related to environment, stalls can be due to the SSD hardware and kernel I/O scheduler.

Verbose response
  • wc will be consuming from a file descriptor, unrelated to BTRFS filesystem or disk activity. Any buffering will of input will be memory bound in usage and I/O.
  • Actual disk reads are going to hit the filesystem, and queue commands to the disk. SATA is limited to a queue-depth of 32 IIRC. NVMe is 2^16 (approx 65k). There's some other technical details beyond that which also influence IOPS rate, especially when it's random access.

Additionally some SSDs also have different caching layers, which can be important for writes. On some NVMe you'll have dynamic SLC cache which can sustain a fair amount of I/O for much lower latency. You may also encounter some disks that defer to the filesystem/kernel features to offload some work to CPU instead when the disk hardware is lacking, one of those is related to cache (this is separate from the disk buffer managed by the kernel, I can't recall the exact name but had something to do with "host mode")

These are all external factors that I raised concern over earlier. If you want more accurate insights, use a better controlled environment such as a VM with I/O throttling limits on a disk that uses RAM as backing store (it can still emulate a SATA controller interface for queue depth).

Beyond that, you also have the I/O scheduler as another example. I wouldn't be surprised if that influenced the stalls @Zygo was describing and how they impacted responsiveness. BFQ presumably would assist there as that is a specific issue it excelled at (usually at the cost of throughput perf).

@Zygo seems to know their stuff much better than I do, so the BFQ suggestion may be irrelevant 🤷‍♂️


I just note that no one still has demonstrated the btrfs default policy to be "a bad idea". No issues have been properly opened for it, yet it is in use in many thousands of Linux installations.

TL;DR: Because context matters, these are not changes you can easily convince upstream to change AFAIK.

It's only really relevant to users impacted, and those generally possess the expertise or can outsource it to resolve the issue. That's quicker/cheaper than trying to upstream change.

Verbose version
  • Any user with usage patterns like yourself aren't likely to be concerned about this, since you're not impacted. If anything it'd go unnoticed. Most aren't as curious as you.
  • Some may be affected a bit more, but they may also be comfortable with throwing money at larger and faster storage, or other workarounds if a technical understanding is out of their comfort zone. Common for consumers and for businesses (which may defer to a consultant instead to figure it out).
  • Those with more technical experience where the issue is more relevant may not notice the problem or experience it because they're already used to mounting with noatime.
    • If they're comfortable with that, there is often very little reason to understand the atime issue beyond what they're familiar with (not just with BTRFS as atime was historically a concern in general, habits tend to carry over).
    • For many in this position it's simpler / easier to just apply noatime themselves, possibly advise others. Effort to discuss changing from relatime isn't likely worth it, as can be the case with contributing change to many projects.

Some defaults are not always ideal, but stick around for legacy reasons or are heavily context dependent. They may be good enough, or a breaking change is not considered worthwhile (as I believe was discussed in the fedora link, since some software still relies on atime).

A similar example would be RLIMIT_NOFILE remains at the 1024 default, mostly for legacy reasons with the kernel select() syscall that is not compatible with a higher limit, and can still be found to exist in popular software codebases today.

polarathene avatar Mar 12 '24 04:03 polarathene

I don't think the SSD performance is very relevant to stalls here, as I've observed similar numbers on a range of faster hardware. I didn't observe any unusually long IO waits at the block level in this test. Contrast with an actual SSD stall issue, like when a drive's SLC cache fills up, where there is a pattern like 160 KiB writes that take 5 entire seconds to complete at the SATA interface. The devices I am using here (yes they are both SSD) are built for continuous low-latency writing, and I have these models running much heavier workloads (like bcache on a heavily used NAS) without stalling issues.

IO utilization is highly variable during the test. During the low IO times, btrfs burns a core and a half of CPU on delayed ref processing and block allocation functions, leaving the other CPU cores (and IO) mostly idle. This seems to be the limit of the CPU btrfs can use--more cores in the machine increase neither CPU nor IO utilization, but changing the CPU or memory speed has a direct proportional effect on throughput. btrfs's high CPU usage and inability to make use of multiple CPU cores effectively are long-standing and well-known issues. It's one of the motivations for the extent_tree_v2 work and some other btrfs-todo projects targeting excessive lock contention. Dave Chinner has complained about btrfs in extensive technical detail for anyone to Google.

Delayed ref and deferred metadata structures are relatively small in kernel RAM, so a large number of them can accumulate in a single transaction; however, when those updates are materialized as tree page updates, they become huge. It's going to take a while to push many GiB through a SATA port, even with optimal write sizes (which btrfs definitely isn't using--average IO size during the test was 24K, which is terrible on a SATA device). During a transaction commit, btrfs can't accept any further writes from userspace--there is a single global lock in the filesystem for transactions, all VFS API calls that modify the filesystem metadata have to acquire this lock (or allocate memory to defer the update, which eventually runs out), and the flushing transaction will hold the lock until the entire tree update is done.

Zygo avatar Mar 12 '24 05:03 Zygo

I'd like to chime in that this thread has been very enlightening and educational, worthy of an article on https://lwn.net/

Ultimately, perhaps distros should decide on the atime/relatime be or not be set. Their installers should advise users on the implication when using Btrfs (and other cow fs?). Until then, spreading knowledge via articles and tickets like this is a good thing.

Forza-tng avatar Mar 12 '24 11:03 Forza-tng

@Forza-tng, I love absorbing your blogs when they pop up and had not expected you to notice this. Thanks.

@polarathene , I never intend to dismiss any data. I can't use the rsync report since as I mentioned before too few details were given in my estimation. Sorry to be obtuse I don't want to debate it. My test was on a sata SSD drive. Cruxial MX500 (250GB), I thought it was 500GB when I ordered it hah!

@Zygo, Sorry to hear the volume was borked in the end. I do have to just agree to disagree about that what that piped command was doing... anyway.

I'm surprised to hear over 100 snapshots should not be considered exceptional. I thought they were kept for robustness testing purposes. I have never collected snapshots like they are featherlight objects. I guess you show they can be under noatime. I'd still be nervous something will creak until the day btrfs is perfected.

But by keeping an armful of snapshots I get a really amazing undo feature on my workstation. 'Timeshift' and 'snapper' manages them breezily without having to do console commands. The worst btrfs issue I had was with KDE's Baloo file indexer, getting itself confused by btrfs volume and eternally re-indexing Home. I had to virtually disable it since it would not finish reading in idle, and its database size was threatening free space. Thankfully its fixed now.

And of course manual defrag is dangerous to the casual administrator, the last time I deleted all my snapshots to have a nice defrag, Timeshift made an automatic one during the process and I crashed out of space arggg! But that serves me right for fiddling around and disabling autodefrag.

The demonstration of this no/atime issue, was very revealing but also it also leaves big questions...

My own metadata grew by 400 megs by updating 400k file times with 5 snapshots. I don't think that's a problem, its fair cost for having snapshots and modified files (times). If I had 115 snapshots, and the metadata increase scales with them, then it w-should have grown by about 9 Gigs. That's getting a bit heavy, but still not an emergency. I could not determine how many files times were updated in Zygos operations, perhaps a similar amount? Maybe it got through 20% through the 2 million files ?

  • The big question is why did it grow by over 100 Gigs ?
  • Does the bloat scale quadratic-ally to the number of snapshots ?
  • Is there a critical number of snapshots that triggers this ?
  • Is it a bug caused by moving from noatime to rel/atimes ?
  • Could a patch stop some tangle occurring in the metadatas metadata ?
  • Does the excessive growth occur from non-timestamp modifications to many files ?

With all those questions and more unanswered about what the issue is here, I still don't think it is justified to advise against the default policy or even to worry over it with warnings in installers.

My test was on the default, and gave a sample size of 1 having no problem with it. Maybe my snapshot collection is a bit too frugal to totally re-assure. Don't forget the silent sample of thousands of in use Linux installations running Btrfs with relatime and not having generated a single concrete complaint over it yet. If it is a big issue, in normal use systems will slow and bloat and users will complain in numbers. What has been expertly extrapolated here, has not happened yet.

strainer avatar Mar 13 '24 00:03 strainer

its fair cost for having snapshots and modified files (times).

Just for clarity:

  • atime is about when a file was last accessed, which is usually not that useful.
  • mtime is when a file was last modified, a more useful attribute.

Do you have any known use cases that matter to you to keep atime updates, or is this just about trusting the defaults for you, especially since in your case the impact isn't as bothersome?

I do have to just agree to disagree about that what that piped command was doing

Run the command and see for yourself?

The find is only outputting a list of lines that also includes file paths, but wc is just counting those lines, I don't see why you'd think it's processing files 🤷‍♂️


I can't use the rsync report since as I mentioned before too few details were given in my estimation. Sorry to be obtuse I don't want to debate it

I responded with why "too few details" was irrelevant, but ok.

My test was on a sata SSD drive. Cruxial MX500 (250GB)

Thanks! That was unexpected, but that is a very good SATA SSD 👍


I have never collected snapshots like they are featherlight objects. I guess you show they can be under noatime.

BTRFS snapshots are light compared to other filesystems snapshot features. Should only be the delta in change with everything else shared. Some features like defrag can accidentally mess with that afterwards though.

should have grown by about 9 Gigs. That's getting a bit heavy, but still not an emergency.

Depends on context.

My system regularly had about 10GB free for months, sometimes it's a little better at around 30GB. Eventually I'll get a larger disk, but that sort of disk waste for something that contributes little value would cause me problems.

Some software I run does not appreciate it when out of disk space obviously, I have to be mindful of that risk as it's more problematic to recover from.


With all those questions and more unanswered about what the issue is here, I still don't think it is justified to advise against the default policy or even to worry over it with warnings in installers.

You got a detailed explanation from @Zygo that explained why it happened... many of your questions would be answered just by reading and properly comprehending his responses.

silent sample of thousands of in use Linux installations running Btrfs with relatime not having generated a single concrete complaint over it yet

As I said in a prior comment, context.

  • Users with experiences like yours aren't affected, doesn't matter.
  • Users that would be affected either aren't because their experience already has them mount with noatime, or they found the solution and know it's less effort to fix and move on vs trying to debate and justify a change upstream.
  • Then there are the vocal users, that provide insights that contrast against your own for where it is a problem.

That is common. There are plenty of issues like these, I previously cited RLIMIT_NOFILE as an example. Especially within container workloads where the default was commonly infinity. Plenty of complaints from the vocal set of years, while I made a considerable effort to push for change upstream, it took a long time and resulted in some users unhappy with software that relied on the previous flaw.

polarathene avatar Mar 13 '24 01:03 polarathene