arctic icon indicating copy to clipboard operation
arctic copied to clipboard

Initial work on transactions for chunkstore

Open bmoscon opened this issue 5 years ago • 6 comments

  • Please do not merge, this is a work in progress - I do want to get some feedback now as I finish out the testing of this change
  1. On every write, creates a transaction document. When write completes, removes the transaction document. Reads and writes check for this, and if its present raise an error.

  2. Supplies a recovery function that can remove the data from a failed transaction. Chunkstore has no way of knowing if a write is in progress or if its has failed, so it cannot recover automatically.

Still testing the recovery method and adding unit tests

bmoscon avatar May 19 '19 19:05 bmoscon

Is my understanding correct that this happens as you are basically writing the chunk (data + metadata) sequentially due to bulk_write not being atomic and any interrupts causes a bad intermediate state due to committed metadata?

  • Can having better SIGINT or even a try / finally help with this to start with?
  • Also personally I find using a context manager cleaner than explicit the start/end if possible.
  • Can you add the cleanup as part of some generic _fsck like op that fixes all of these cases?

I haven't spent much time with chunkstore so apologies in advance for stupid questions

shashank88 avatar May 20 '19 21:05 shashank88

yes, chunkstore creates the chunks, and then writes them one at a time in a bulk operation, so any interruption of that write causes a corrupted state. A power outage or sigkill would be a case where a finally or the like would not be sufficient. There is also the case where some sort of multiprocessing code might try and read/write the same chunk, which would also cause issues. I can certainly change the code to use a context manager, but want to make sure the code works as expected before that (this is WIP). Also, trying to auto recover/fix is not really possible due to the nature of the issue. You can definitely hit an "invalid read" or write scenario where you wouldnt want to try and fix the issue (i.e. a concurrent reader and writer).

bmoscon avatar May 20 '19 21:05 bmoscon

@bmoscon Any updates on this?

scoriiu avatar Oct 09 '19 06:10 scoriiu

@bmoscon We would love to use this in production in our application however this issue is holding us back... Any update on when we can expect a fix please?

harryd31 avatar Oct 13 '19 15:10 harryd31

@harryd31 fix is here, give it a try, let me know if it works or doesnt work

bmoscon avatar Oct 18 '19 00:10 bmoscon

@bmoscon Thank you, I will give it a try and let you know.

harryd31 avatar Oct 21 '19 19:10 harryd31