godot icon indicating copy to clipboard operation
godot copied to clipboard

Sync filesystem before rename during atomic write

Open jadoc opened this issue 1 year ago • 3 comments

Make sure the contents of the file are written to disk before renaming it on top of any existing file. In the event of power loss or OS crash, this ensures that the file will contain either the old version or the complete contents of the new version. Otherwise, the user could be left with an incompletely written file and data loss.

In each of the Windows and Unix implementations of FileAcccess, add an internal sync() operations that's called right before the rename if is_backup_save_enabled(). On Unix, sync() calls fsync() on the underlying file descriptor. On Windows, sync() calls FlushFileBuffers() on the underlying file handle.

This fixes #98360.

jadoc avatar Oct 20 '24 10:10 jadoc

Hi!

Overall looks fine but there's a couple comments I have:

fileno can fail

Currently:

int fd = fileno(f);
fsync(fd);

I suggest to change it to:

int fd = fileno(f);
if( fd == -1 ) {
   // Error. Set badbit
}
else {
  int status = fsync(fd);
   if( status != 0 ) {
       // Error. Set badbit
   }
}

By "badbit" it means tagging the structure as bad, since something (likely unrecoverable) went wrong with file handling.

Handling Apple

On Apple you need F_FULLFSYNC.

#ifdef __APPLE__
		int status = fcntl( fd, F_FULLFSYNC, NULL );
		if( status )
		{
			// If we are not on a file system that supports this,
			// then fall back to a plain fsync.
			status = fsync( fd );
			if( status != 0 ) {
				// Error. Set badbit;
                        }
		}
#endif

See this discussion.

On non-Apple, consider fdatasync instead

Instead of calling fsync, call fdatasync.

Update: sqlite checks at build time whether fdatasync is reliable or not, and choses between fdatasync and fsync.

No benchmark data provided, no regression tests

Occasionally fsync/fdatasync can cause global system slowdown.

At process-level indiscriminately applying fsync/fdatasync at random moments while many IO operations are performed can cause significant slowdowns. I don't know how often Godot calls _sync().

These operations should be used sparingly for specific file data where preserving its contents is of utmost importance. This usually means config files and user data.

Editor's lost files that can be restored because they're already on version control should not use this feature. Files that go into the cache should not use this feature.

The problem it's solving is a big one (without proper fsync, power failures can cause both the backup and the new data to be lost*), but there needs to be some analysis on the impact as it can potentially put loading times back to 1990.

* It is a bit concerning the Reddit post you linked to says all open scn files were destroyed as a result of the power failure. That suggest Godot is aggressively overwriting files even without any changes.

Ideally, all IO files being written would defer fsync-then-OS-rename to the end of all IO processing, to maximize throughput. But that would likely require a higher level redesign.

darksylinc avatar Oct 20 '24 15:10 darksylinc

fileno can fail

The only way fileno() can fail is if a bad argument is provided, which shouldn't be possible here. Regardless, the error handling in FileAccess is pretty much nonexistent for anything but the initial open. There is no "bad bit" to set. Fixing error handling in FileAccess would be a whole separate (and rather large) proposal. Though, I could log an error message for such an unlikely scenario.

Actually, what I should do is retry fsync() when it returns EINTR.

On Apple you need F_FULLFSYNC.

It looks like Apple now recommends F_BARRIERFSYNC. I really hate Apple's poor documentation of these options.

On non-Apple, consider fdatasync instead

I don't think this will make any noticeable performance difference. fdatasync() helps in two scenarios:

  1. If using a spinning disk, it avoids the seek to the inode after writing file data.
    • Any user expecting reasonable performance from Godot will be using SSDs.
  2. When doing repeated appends to a file, if avoids repeatedly updating the inode on each append.
    • sqlite has this behavior. Godot does not.

I would rather force the mtime to be synced so that a user inspecting their save files after a power loss sees the correct timestamp for when their game is saved.

These operations should be used sparingly for specific file data where preserving its contents is of utmost importance. This usually means config files and user data.

Unfortunately, Godot's FileAccess does not allow this distinction to be made. There is a single global setting, OS.set_use_file_access_save_and_swap(), that sets the desired level of consistency for all FileAccess operations in the engine regardless of file type. Fixing this would be another non-trivial proposal.

Editor's lost files that can be restored because they're already on version control should not use this feature.

I disagree with this statement. The editor doesn't really have real-time performance expectations when it comes to operations like saving files. An application shouldn't be relying on an external tool for basic corruption protection of user data.

The fact that I've found two cases of people losing large parts of their projects due to this is evidence that we shouldn't simply rely on version control. Many Godot users are artists, not computer scientists. They may be struggling with Git, and therefore using it less frequently than you or I.

Also, even if we told users to make sure they updated version control after each change, there would be no performance gain as Git is calling the exact same fsync() operation itself.

The real problem with the editor is the serialization of file operations. A files is completely written and closed before moving on to the next file. A better way to handle this in the editor would be to make sync() a public method on FileAccess. The editor could then:

  1. Write all the pending save files, but not close them.
  2. Sync all of the written save files.
  3. Close all of the written save files.

This would mean that we would only wait once for physical storage to catch up. Again, this would be a proposal beyond this scope of this change. I think this what you are suggesting at the end of your comment.

The problem it's solving is a big one (without proper fsync, power failures can cause both the backup and the new data to be lost*), but there needs to be some analysis on the impact as it can potentially put loading times back to 1990.

Just to be clear, this isn't going to affect load times. My subjective ad-hoc testing shows minimal impact when dealing with a few large files on a USB thumb drive and no noticeable impact when dealing with a high performance internal SSD.

As games are generally writing only one or a handful of files during each save operation, I don't expect this change to be noticeable in the vast majority of games. Also, only games that have explicitly set OS.set_use_file_access_save_and_swap(true) could possibly be affected, as the default for this setting is false.

The real opportunity for noticeable performance regression is when large number of files are written sequentially. The editor may do this, and the editor does set OS.set_use_file_access_save_and_swap(true). This change is likely to be noticeable in the editor for some projects. However, I think the benefits to preventing data loss in projects people have spent a lot of time on are well worth it.

I'll see if I can find some time to set up an example project with 1,000 small/medium files and measure the time difference in saving the project.

jadoc avatar Oct 20 '24 18:10 jadoc

It looks like Apple now recommends F_BARRIERFSYNC. I really hate Apple's poor documentation of these options.

Agreed :smile:

Just to be clear, this isn't going to affect load times. My subjective ad-hoc testing shows minimal impact when dealing with a few large files on a USB thumb drive and no noticeable impact when dealing with a high performance internal SSD.

OK I'll elaborate. I've been bitten by the fsync monster before. Let's say Godot is calling fsync 240 times per second for 10 seconds. This would annihilate overall system performance because of forcing SSDs to flush aggressively. It also severely wears down SSD cells. We are talking about 10x to 1000x level of slowdown. Specially on high performance SSDs, because the high performant one (i.e. server-grade) are written to fully honor flushes.

It'd be the equivalent of setting dirty_writeback_centisecs to 0.41 (the Linux kernel only accepts integer values).

Basically what I'm saying is some extra guards are needed. Off the top of my head (I'm talking out loud):

  1. Gathering statistics. When the editor closes they're printed to the console in the form of flushes per second.
  2. Track flushes in the form of flushes per interval. If a threshold is exceeded (e.g. 20 flushes in 1 second), it is printed to the console as a warning.

Basically we need to understand what's going on and mark potential sources of performance problems.

I'll see if I can find some time to set up an example project with 1,000 small/medium files and measure the time difference in saving the project.

I don't really care if saving 1000 small files takes 100x longer, because it is done so safely (but of course, being able to improve the time it takes with higher level refactors is always welcome).

I'm more worried about Godot using sync() in unnecessary moments, causing overall slowdown at all times or during project loads.

Which brings me to:

The fact that I've found two cases of people losing large parts of their projects due to this (...) As games are generally writing only one or a handful of files during each save operation (...)

I'm just saying we need to understand the problem more and understand what's going. Sure, the fsync() is necessary and your PR will fix it.

But I am heavily worried that ~12 open scenes were all lost after a power failure. That is not normal, even if the app doesn't fsync().

Having 1 to 3 files corrupted would be normal. 12? Something's up and needs more research. Gathering data is key.

darksylinc avatar Oct 20 '24 20:10 darksylinc

FYI, to address the most relevant bits first, I'm quoting you out of order, but hopefully not out of context. Some of what I've written near the end is just to clarify my own understanding.

But I am heavily worried that ~12 open scenes were all lost after a power failure. That is not normal, even if the app doesn't fsync(). Having 1 to 3 files corrupted would be normal. 12? Something's up and needs more research.

The informal report from the user stated that "the majority of the dozen or so scenes I had open" were corrupted. Depending the level of hyperbole in that statement, this could be anywhere between 5 and 9 files. This is suspicious, but not implausible. The .tscn files often are not very large, so they could all easily be in the write cache at the same time. All it takes is for the user to have made changes to 5+ scenes before saving, which may happen during a refactor.

I'm just saying we need to understand the problem more and understand what's going.

One way to resolve fault ambiguity is to remove obvious defects so that less obvious defects become apparent.

Basically what I'm saying is some extra guards are needed. Off the top of my head (I'm talking out loud):

Gathering statistics. When the editor closes they're printed to the console in the form of flushes per second. Track flushes in the form of flushes per interval. If a threshold is exceeded (e.g. 20 flushes in 1 second), it is printed to the console as a warning. Basically we need to understand what's going on and mark potential sources of performance problems.

Point taken. Though, it's still not clear to me if you are more worried about user code or the editor.

I see that Godot doesn't have any performance monitors for I/O. Maybe we could add that. [^1]

For user code, this data would be visible in the editor. For the editor itself, we could have it emit the stats on stdout when run with --verbose. It could emit the fsync() count over the last interval whenever it's non-zero. I generally don't like hardcoded thresholds (e.g. 20 in the last second) as they tend to go stale as hardware gets faster and we may hide issues on older systems.

If you like the idea, would you prefer to see such a monitor as part of this change, a prerequisite to this change, or a follow-up after this change?

Let's say Godot is calling fsync 240 times per second for 10 seconds.

To be clear, this means someone is sequentially writing 2,400 distinct files via the FileAccess API after explicitly setting OS.set_use_file_access_save_and_swap(true). This is possible, but I'm guessing very rare. I could be wrong.

This would annihilate overall system performance because of forcing SSDs to flush aggressively.

It would affect read performance as well as write performance, but I don't expect most games to be loading at the same time they are saving. Maybe that's a bad assumption. Or are you worried about page file contention?

We are talking about 10x to 1000x level of slowdown. Specially on high performance SSDs, because the high performant one (i.e. server-grade) are written to fully honor flushes.

Enterprise grade SSDs generally contain capacitors that allow them to acknowledge synchronous writes while data is still in the cache, so I would think the issue would be considerably less noticeable there.

It also severely wears down SSD cells.
(...)
It'd be the equivalent of setting dirty_writeback_centisecs to 0.41 (the Linux kernel only accepts integer values).

This is an interesting comparison. dirty_writeback_centisecs affects all I/O on the system and may cause a file to be partially flushed while still being written, causing substantial write amplification. However, this flush is asynchronous, so should be less disruptive to SSD performance.

The proposed use of fsync() is only done after a file is fully written and only affects the blocks relevant to that file. The next write is going to be for a different file whose LBA for the first data block is mapped to a different page in the SSD anyway. There should be no write amplification for the file's data blocks. Conversely:

  1. The journal will be likely be synchronously flushed when it isn't on a page boundary, resulting in moderate write amplification.
  2. One will occasionally get unlucky and need to sequentially write inodes in the same page, resulting in a miniscule write amplification.

The major problem with fsync() is that it makes the writes synchronous, which is highly disruptive to SSD performance.

In either case, this all breaks down once garbage collection in the SSD in the SSD is overrun, at which point, yes, write amplification and wear rapidly accelerate. On a 512GB SSD that's about 80% full, one should be able to write on the order of 100GB before overrunning garbage collection, assuming no other recent writes.

I think the conclusion here is that a small setting for dirty_writeback_centisecs is likely worse for SSD wear, but rapid use of fsync() is much worse for performance.

[^1]: Though, it's unfortunate that this interface requires a fixed interval to be specified at metric definition rather than provide and abstraction on top of an internal monotonically increasing counter which allows a custom and not perfectly uniform interval to each consumer. As it is, small timescale performance events are at risk of getting lost.

jadoc avatar Oct 22 '24 03:10 jadoc

So, the docs Apple published to the web for fcntl() don't mention F_BARRIERFSYNC, but the man page for fcntl() on a reasonably up to date MacOS does document it. It relies on a proprietary NVME barrier operation only supported by Apple SSDs. Issuing F_BARRIERFSYNC forces the SSD to complete all writes issued before the barrier before any writes issued after the barrier. The device can still reorder writes as long as the barrier isn't violated. The return of fcntl() is not blocked on the writes completing.

This is the exact behavior we want. I also checked the XNU source and see that there is an automatic fallback to F_FULLFSYNC if the SSD or the filesystem isn't APFS. Though, we may want to explicitly fall back to F_FULLFSYNC if the OS is older than macOS 12.

Now I just need to find a Mac I can test on.

jadoc avatar Oct 26 '24 10:10 jadoc

Now I just need to find a Mac I can test on.

@jadoc Don't hesitate to join the Developers Chat, especially channels like #platforms. I do have a Mac, so if you need testing, I can do.

adamscott avatar Oct 28 '24 16:10 adamscott

I'm gonna try on a macOS VM (if everything goes well).

adamscott avatar Oct 29 '24 16:10 adamscott

  • It is a bit concerning the Reddit post you linked to says all open scn files were destroyed as a result of the power failure. That suggest Godot is aggressively overwriting files even without any changes.

See https://github.com/godotengine/godot/issues/32086 and https://github.com/godotengine/godot/issues/31186.

Calinou avatar Oct 31 '24 11:10 Calinou

See https://github.com/godotengine/godot/issues/32086 and https://github.com/godotengine/godot/issues/31186.

This is really good info. Thanks! It definitely aligns with what I've been seeing (now that I'm paying attention thanks to this ticket) and what I feared.

  • RE: This ticket. Looks alright. I'm not part of the team that decides for its inclusion but LGTM.

  • RE: Overall IO problem. It needs a follow up. Once this ticket gets merged, fixing unnecessary IO will become critical. Otherwise simple things like Godot Editor gaining focus may end up with visible 0.2-1s stalls (that will depend on OS & SSD model/brand).

darksylinc avatar Oct 31 '24 19:10 darksylinc