taskwarrior
taskwarrior copied to clipboard
[TW-94] Atomic file writes
Paul Beckingham on 2009-05-25T21:43:36Z says:
Guaranteed atomic file writes via mv
Migrated metadata:
Created: 2009-05-25T21:43:36Z
Modified: 2017-01-16T16:18:47Z
Jakub Wilk on 2013-03-04T23:22:19Z says:
Any news on this? I've just lost half of my pending.data because I pressed Ctrl+C in a wrong moment. :(
Paul Beckingham on 2013-03-05T01:27:47Z says:
Any news on this?
Nope. No patches received either.
I've just lost half of my pending.data because I pressed Ctrl+C in a wrong moment. :(
Don't do that. If you don't have a backup copy, the pending.data can be recreated from undo.data using "task merge".
Paul Beckingham on 2014-12-27T16:37:07Z says:
The "task merge" advice above no longer applies. Everyone make backups.
Wim Schuermann on 2015-07-22T21:17:57Z says:
The problem with getting this right(tm) is that writes aren't truly atomic unless backlog, completed, pending, and undo.data are all written atomically at the same time. So to get this completely right, one would have to replicate the $TASKDATA directory and atomically move that over once all writes are done. That this is problematic and bordering on the point of being impossible to get right (What about hooks that use $TASKDATA for data storage, for example?) should be obvious.
Are there better options for this? For example, would writing each individual file atomically be enough if the user were informed about potentially mismatched data files?
Daniel Shahaf on 2015-08-14T01:44:15Z says:
Another workaround: keep ~/.task/ in version control and make automatic commits from the on-exit
hook. Then if the data files get corrupted, just revert to the state after the most recent commit.
Paul Beckingham on 2017-01-16T16:18:35Z says:
Files could be individually written, say pending.data could be written to pending.data.temp, completed.data to completed.data.temp, and then finally, rename all the *.temp files. That's easy, would address this reported issue, and although not atomic, is a couple of steps in the right direction.
// Looking at the old issues, let me know if it's not helpful.
I took a look at this. There are two related problems here:
Protection against "normal" process termination. This is relatively easy to solve - we just need to catch SIGINT / SIGTERM & co while doing commit. This appears to be already implemented.
Protection against crash - either due power loss, or a bug (related issues #88, #1420). This is much more difficult - there are no simple solutions, APIs are horrible and it's very easy to mess it up. (http://danluu.com/file-consistency/, https://utcc.utoronto.ca/~cks/space/blog/unix/FileSyncProblem, https://lwn.net/Articles/667788/). Basically this involves writing database engine which would carefully arrange journals and fsync/fdatasync calls. I don't think it's worth doing as part of TaskWarriror - it is best to reuse existing library, such as sqlite or some embedded key-value database like LMDB or LevelDB. This was discussed before and once again proposed in recent #2507. This is a big decision so I don't think we should do it here.
Two smaller improvements which may help for some crashes are:
- add fsync/fdatasync calls to the end of
File::close
. This will cause small performance penalty so not everyone may be happy. - in the non-append codepath of
TF2::commit
instead of truncating and overwriting file in-place we can be creating a new file (e.g. with.tmp
suffix) and then renaming it over the existing one.
I can work on these two things. (2) should also fix corruption during full disk
// Looking at the old issues, let me know if it's not helpful.
It's definitely helpful!
add fsync/fdatasync calls to the end of File::close. This will cause small performance penalty so not everyone may be happy.
In general, I'm in favour of trading off (small) performance hits for better robustness - I think we should do this. Perhaps some of the cost could be offset by only fdatasyncing if we actually modified the file (though perhaps in that case the cost of the fdatasync
is negligible?)
in the non-append codepath of TF2::commit instead of truncating and overwriting file in-place we can be creating a new file (e.g. with .tmp suffix) and then renaming it over the existing one.
:100: Indeed, the full disk corruption has hit some users over the years.
@vrusinov The discussion here I guess correctly points out that all data files would have to be moved in one transaction for this to ensure 100% correctness, however, given this issue has been open for more than a decade (yikes!), I think your approach for (2) is more than sufficient, and better than the current state of things :slightly_smiling_face:
Just FYI, There is a version of approach (2) in the timewarrior code base (https://github.com/GothenburgBitFactory/timewarrior/pull/283) that also includes tests with libfiu.
While it uses global state under the hood, it works by replacing all the File objects with an AtomicFile object. AtomicFile internally creates a tmp file when the file is modified, and uses it, otherwise there is no change. Then at normal program termination, signals are masked and the temp files are moved over the real files.
https://github.com/GothenburgBitFactory/timewarrior/blob/v1.4.3/src/AtomicFile.cpp#L477-L483
Maybe that code should be updated (to have explicit file sets that can be "finalized" instead of the one global set )and moved to libshared?
@sruffell thanks, yeah - I should be able to reuse at least some of that!
An OT-based data model will definitely have stronger consistency guarantees. That means it needs much better support for crash-proof on-disk storage, as discussed here; and it means that hand-editing the data is a very bad idea. Both of those points strongly suggest a binary format. As the data is pretty strictly key/value, I think LMDB or something like it makes sense. Sqlite is more general and thus a bit harder to use, and still invites manual editing by people who think they know what they are doing :)
@vrusinov The initial work that calls fsync/fdatasync
is now merged. Let's keep this open until we have reconciled / generalized @sruffell 's work with AtomicFile
interface?
@tbabej thanks for merging, and thanks for bumping libshared submodule here too. Will be good to have some mileage via tests and/or people who compile from head.
AtomicFile
is the next one here - it'll take me a wee longer (vacation + I'm not very good at c++). I intent to resolve this issue after AtomicFile
is done. I agree with looking into binary formats and some embedded database but I think it should be done/discussed in a separate issue.
I agree with looking into binary formats and some embedded database but I think it should be done/discussed in a separate issue.
Just taking a note that using an embedded database for speed + consistency also came up in timewarrior's issue tracker (https://github.com/GothenburgBitFactory/timewarrior/issues/244), where it was pointed out to me that that part of the Taskwarrior Philosophy is:
Your data is kept as plain text, and never held hostage by a proprietary format.
@sruffell Yeah, and I do think some users do feel strongly about that, but it should not stand in the way of progress. In particular, technically, the current data format is already custom (there are no third-party parsers for the data format), if not proprietary per se.
I think one way to reconcile this going forward would be to provide pluggable backend systems, where embedded database would be the more performant / reliable format for users that do not care about having their data stored in plaintext, while still providing the option to store data in plain text for users that feel strongly about that.
@vrusinov any unfinished work with AtomicFile
that we could try to get over the finish line? This would be great stability improvement to ship as part of 2.6.0 :slightly_smiling_face:
@tbabej sorry, nothing to show yet - too many events outside github recently :)
I am hoping to get something up for discussion this week though
Draft is here, not quite ready to be merged (needs more tests): https://github.com/GothenburgBitFactory/libshared/pull/64
Taskchampion uses a SQLite database, which is log-based and does not lose data on interruption.