dice icon indicating copy to clipboard operation
dice copied to clipboard

BGREWRITEAOF is not failure safe

Open arpitbbhayani opened this issue 3 years ago • 17 comments

Describe the bug BGREWRITEAOF starts overwriting the existing file; hence in case of a failure, while creating the AOF file, the old flush is lost while the new one is partially updated.

To Reproduce

  1. add random sleep in BGREWRITEAOF after each command is flushed on the disk
  2. crash the process while BGREWRITE is happening
  3. see the AOF file content

Expected behavior Until the new AOF file is not ready, we should not alter the old file.

arpitbbhayani avatar Oct 16 '22 07:10 arpitbbhayani

Yes, this is a major issue when I tested. A most stable policy would be to write the new flushes to a temporary file and then synchronize them with the main thread's AOF file. I'll work on this.

rohanverma94 avatar Oct 16 '22 07:10 rohanverma94

What if we write to a temporary file and then do a rename?

I think it should be okay if we do not maintain a sync through thread. AOG persistence is anyway just a backup not the primary storage.

Let's evaluate the complexities once.

arpitbbhayani avatar Oct 16 '22 09:10 arpitbbhayani

Yes, creating a temporary and then renaming is also a good approach. To reduce complexities and to keep the concurrency model simple enough we can delegate the write to fsync() and fdatasync() system calls.

Regarding AOF persistence, fsync is for durability guarantees and is the least costly among other options to implement the same.

Yes, the primary function is in-memory operations. Redis is not durable by default and if someone uses it as a durable store with AOF and with frequent fsync we would loose most of Redis's advantages. A Redis with frequent fsync is a "Kafka in hiding" :smile:

rohanverma94 avatar Oct 16 '22 10:10 rohanverma94

Hello, can i take this?

amityahav avatar Aug 02 '24 10:08 amityahav

Hello, can i take this?

Assigned

JyotinderSingh avatar Aug 02 '24 16:08 JyotinderSingh

@arpitbbhayani , hey , i see that currently dice does not support actual aof logging right after each mutation operation but it solely for "snapshoting" the current state, i mean, there's is no actual rewrite of some runtime increasing aof file but rewrite of previous rewrites which are already optimized as the smallest aof possible for that moment. anyways, do you plan to support immediate logging in the future?

amityahav avatar Aug 02 '24 21:08 amityahav

btw did you reproduce it on mac or linux? cuz on mac forking seems not to work properly

amityahav avatar Aug 03 '24 12:08 amityahav

@amityahav I have not worked on this for a very long time. So I lack context :) Also, I think we do lack logging.

arpitbbhayani avatar Aug 03 '24 15:08 arpitbbhayani

@amityahav, are there any findings from your side on this?

AshwinKul28 avatar Aug 16 '24 06:08 AshwinKul28

@AshwinKul28 Hey , started thinking about it but didn't have the chance to start working on it. I hope to start this weekend

amityahav avatar Aug 16 '24 10:08 amityahav

@AshwinKul28 Hey, did you manage to run this command on a Mac M arch? syscall.Syscall(syscall.SYS_FORK, 0, 0, 0) does not seem to work for me. how ever, executing syscall.Syscall(syscall.SYS_CLONE, 0, 0, 0) on ARM64 linux works well

amityahav avatar Aug 17 '24 14:08 amityahav

https://github.com/DiceDB/dice/pull/398 @arpitbbhayani this should be fixed with this

https://github.com/DiceDB/dice/issues/397

ashwaniYDV avatar Aug 24 '24 18:08 ashwaniYDV

@ashwaniYDV Hey, I've been assigned to that and as you can see there's already a draft for it. What I've done extends your PR. I'm about to also ensure no 2 flushing processes can occur simultaneously. As well as cleanup when there's a failure

amityahav avatar Aug 24 '24 19:08 amityahav

Hey @AshwinKul28 , @arpitbbhayani Its seems that it is not trivial to fork a GO process just like it is done in C. I tried to implement Fork and Wait both in darwin and linux archs:

  • for darwin i have managed to get this to work, i was able to rewrite the AOF file and wait for the child process to terminate in the parent side. it was done using syscall.Syscall(syscall.SYS_FORK, 0, 0, 0) and waiting using: syscall.Wait4(int(childPID), &ws, 0, nil)
  • however, for linux the fork syscall is syscall.Syscall(syscall.SYS_CLONE, 0, 0, 0) and using the same syscall.Wait4(int(childPID), &ws, 0, nil) (i didnt find any alternative for the wait call). Wait syscall always returns an error no such processes to wait. the parent process successfully forks itself and returns the child's PID. when trying to wait on the returned PID the error mentioned is returned. for some reason it is unable to detect the child proc.

I would very much appreciate if you have some exprience, insights on this, since im out of ideas here and the Wait syscall is required to properly handle the BGREWRITE command.

Thanks

amityahav avatar Sep 07 '24 15:09 amityahav

Hello @amityahav,

There has been no activity on this issue for the past 5 days. It would be awesome if you keep posting updates to this issue so that we know you are actively working on it.

We are really eager to close this issue at the earliest, hence if we continue to see the inactivity, we will have to reassign the issue to someone else. We are doing this to ensure that the project maintains its momentum and others are not blocked on this work.

Just drop a comment with the current status of the work or share any issues you are facing. We can always chip in to help you out.

Thanks again.

arpitbbhayani avatar Sep 13 '24 06:09 arpitbbhayani

@arpitbbhayani bro I'm literally posting questions above to you and you are not answering to any of them. I need help on this, please read them!

amityahav avatar Sep 13 '24 06:09 amityahav

@amityahav I tried looking into it, but even I do not have any immediate idea about the fix. Also, I do not think the solution will be platform-specific, but I could be wrong.

If we are unsure of the solution or the approach, we can park this issue for now and pick this up, when this really comes back to bite us. But thank you so much for working on this. The insights you left in the comments did give some clarity.

So, moving this to icebox for now. Apologies and thank you.

arpitbbhayani avatar Sep 13 '24 10:09 arpitbbhayani