arctic
arctic copied to clipboard
Initial work on transactions for chunkstore
- Please do not merge, this is a work in progress - I do want to get some feedback now as I finish out the testing of this change
-
On every write, creates a transaction document. When write completes, removes the transaction document. Reads and writes check for this, and if its present raise an error.
-
Supplies a recovery function that can remove the data from a failed transaction. Chunkstore has no way of knowing if a write is in progress or if its has failed, so it cannot recover automatically.
Still testing the recovery method and adding unit tests
Is my understanding correct that this happens as you are basically writing the chunk (data + metadata) sequentially due to bulk_write not being atomic and any interrupts causes a bad intermediate state due to committed metadata?
- Can having better SIGINT or even a try / finally help with this to start with?
- Also personally I find using a context manager cleaner than explicit the start/end if possible.
- Can you add the cleanup as part of some generic _fsck like op that fixes all of these cases?
I haven't spent much time with chunkstore so apologies in advance for stupid questions
yes, chunkstore creates the chunks, and then writes them one at a time in a bulk operation, so any interruption of that write causes a corrupted state. A power outage or sigkill would be a case where a finally or the like would not be sufficient. There is also the case where some sort of multiprocessing code might try and read/write the same chunk, which would also cause issues. I can certainly change the code to use a context manager, but want to make sure the code works as expected before that (this is WIP). Also, trying to auto recover/fix is not really possible due to the nature of the issue. You can definitely hit an "invalid read" or write scenario where you wouldnt want to try and fix the issue (i.e. a concurrent reader and writer).
@bmoscon Any updates on this?
@bmoscon We would love to use this in production in our application however this issue is holding us back... Any update on when we can expect a fix please?
@harryd31 fix is here, give it a try, let me know if it works or doesnt work
@bmoscon Thank you, I will give it a try and let you know.