node-event-storage
node-event-storage copied to clipboard
Failure recovery
After a crash and potentially broken records, the storage should heal itself.
This can be achieved by following steps:
- truncate all indexes to valid file sizes (currently throws an
new Error('Index file is corrupt!')) - check if the partition contains an invalid document (unfinished write), if so, truncate the partition (currently throws an
new Error('Can only truncate on valid document boundaries.')) - check if the partition contains more documents than it's index, if so, reindex the missing documents
- check if the partition contains less documents than it's index, if so, truncate the index
See https://cseweb.ucsd.edu/~swanson/papers/DAC2011PowerCut.pdf for a paper researching behavior of different SSD on power failure
To effectively check for corrupted documents, a checksum is necessary (see #72), otherwise only unfinished writes could be detected. However, filesystems typically increase file size first, then write to the file, so the file size could be correct, but the contents be corrupted. This could only be detected for corruptions breaking the serialization format, but there's still a chance a document gets deserialized that was not written. Checksum only needs to be checked at startup, to not reduce general read performance.
The checksum should contain previous document checksum in order to also be able to guarantee immutability of the whole partition.
After thinking more about checksums, this should be solely a serializer concern and hence fully pluggable. Dictating a checksum into the document has a couple of consequences:
- it decreases write performance, because every write needs to calculate the checksum first
- it needs to make rules on when to verify the checksum configurable or read performance also suffers
- it adds more complexity into the storage layer and requires additional trade-offs for the document header decision (see discussion #72)
- some serialization formats might already contain a checksum, so work would be done doubly
- some use-cases might require stricter checksums than others (just parity byte, crc32 or shaXsum), dictating one would neglect all others
- the choice of checksum algorithm is hardcoded into the storage format (i.e. partition format version) and changing that would be a hard b/c break
- for JSON serialization, the common error-case (torn write) is already ruled out by the document not being able to be deserialized anymore
So checksums only play a role in the following use-cases:
- data is transferred over a medium that may corrupt single bytes
- a custom serialization format is used, that has less formalism than JSON allowing it to technically deserialize incomplete records (msgpack, protobuf)
- guaranteeing the immutability of the store, by checksumming over previous documents
For all those use cases, it is good enough and relatively easy to achieve, by changing the serializer methods. Maybe the common use-cases (custom serialization format, immutability guarantees) should be shown in the documentation.
Requires #24 in order to fix the global index in case it is broken.
With #145 included the next steps are roughly like this:
- truncating should not throw an exception if the truncate position is at a valid document boundary (directly following a separator) [✔️ #151]
- on opening a storage, all partitions should be checked for torn writes, i.e. if the last document is not finished with a document separator
- for all partitions the document sequence number of the last valid document should be returned, if there are torn writes the sequence number of that torn write should be returned as well
- (if there were torn writes) the whole storage should be truncated after the lowest torn write sequence number* [✔️ #155]
- check if the primary index is up to date with the highest sequence number of all partitions, if not reindex all documents following the last indexed document (by scanning all partitions backwards for the first document with a sequence number lower or equal the last indexed document)
*another option would be to only truncate the single torn writes and keep potentially succesfully written later documents in other partitions. However, this would mean that documents go missing in between and sequence numbers have holes. Also, indexes would still potentially point to non-existing/wrong documents.