syncmaildir icon indicating copy to clipboard operation
syncmaildir copied to clipboard

smd can destroy the entire mail spool

Open anarcat opened this issue 3 years ago • 2 comments

I have seen at least three cases of smd attempting to destroy my entire mail spool. In the three cases, I have mostly recovered thanks to backups, notmuch, and the distributed nature of SMD/email. But it's nerve-wracking and, well, really, really bad.

I still don't quite understand what happens. I can't find it in the logs anymore, but I remember seeing something about smd tripping over an empty directory inside the .smd directory. I wrote about his on my blog here:

https://anarc.at/blog/2021-03-22-email-crash/ https://anarc.at/blog/2021-06-29-another-mail-crash/

I think the fundamental problem is that smd-pull and smd-push's locking mechanisms might have a bug. I was able to start both processes at once by furiously hitting "reload" in my email setup (essentially telling systemd to repeately run smd-pull.service), which eventually started at the same time as smd-push. Here's a log from systemd:

jun 28 16:30:04 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:30:09 curie systemd[2845]: smd-pull.service: Succeeded.
jun 28 16:30:09 curie systemd[2845]: Finished pull emails with syncmaildir.
jun 28 16:30:09 curie systemd[2845]: Starting push emails with syncmaildir...
jun 28 16:30:09 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:30:09 curie systemd[2845]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
jun 28 16:30:09 curie systemd[2845]: smd-pull.service: Failed with result 'exit-code'.
jun 28 16:30:09 curie systemd[2845]: Failed to start pull emails with syncmaildir.
jun 28 16:30:22 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:30:22 curie smd-pull[391140]: Already running.
jun 28 16:30:22 curie smd-pull[391140]: If this is not the case, remove /home/anarcat/.smd/lock by hand.
jun 28 16:30:22 curie smd-pull[391140]: any: smd-pushpull@localhost: TAGS: error::context(locking) probable-cause(another-instance-is-running) human-intervention(necessary) suggested-actions(run(kill 390135) run(rm /home/anarcat/.smd/lock))
jun 28 16:30:22 curie systemd[2845]: smd-pull.service: Main process exited, code=exited, status=1/FAILURE
jun 28 16:30:22 curie systemd[2845]: smd-pull.service: Failed with result 'exit-code'.
jun 28 16:30:22 curie systemd[2845]: Failed to start pull emails with syncmaildir.
jun 28 16:31:14 curie systemd[2845]: smd-push.service: Main process exited, code=killed, status=15/TERM
jun 28 16:31:14 curie systemd[2845]: smd-push.service: Failed with result 'signal'.
jun 28 16:31:14 curie systemd[2845]: Stopped push emails with syncmaildir.
jun 28 16:31:14 curie systemd[2845]: smd-push.service: Consumed 6.600s CPU time.
jun 28 16:31:22 curie systemd[2845]: Starting push emails with syncmaildir...
jun 28 16:31:23 curie smd-push[391563]: Found lockfile of a dead instance. Ignored.
jun 28 16:31:30 curie smd-push[392598]: register: smd-client@smd-server-register: TAGS: stats::new-mails(0), del-mails(0), bytes-received(0), xdelta-received(216)
jun 28 16:31:30 curie systemd[2845]: smd-push.service: Succeeded.
jun 28 16:31:30 curie systemd[2845]: Finished push emails with syncmaildir.
jun 28 16:31:30 curie systemd[2845]: smd-push.service: Consumed 4.727s CPU time.
jun 28 16:31:38 curie systemd[2845]: Starting pull emails with syncmaildir...
jun 28 16:32:22 curie smd-pull[393092]: default: smd-client@localhost: TAGS: stats::new-mails(0), del-mails(51458), bytes-received(0), xdelta-received(247)
jun 28 16:32:22 curie systemd[2845]: smd-pull.service: Main process exited, code=killed, status=15/TERM
jun 28 16:32:22 curie systemd[2845]: smd-pull.service: Failed with result 'signal'.
jun 28 16:32:22 curie systemd[2845]: Stopped pull emails with syncmaildir.
jun 28 16:32:22 curie systemd[2845]: smd-pull.service: Consumed 12.632s CPU time.

The first thing to notice is that, at 16:30:09, two jobs are started in parallel. This is a bug in the systemd configuration, which shouldn't allow that in the first place. But considering that smd has its own locking primitives, I consider it's also a bug there. At 16:30:22, you see me triggering another run, and then smd-pull notices its other smd-pull process running and aborts. But it's still running because it's actively destroying all mail on the remote server!!

At 16:31:14 you can see I notice the problem and kill the process. And then at 16:32:22 you can see the timer i forgot to disable fired and destroyed thousands of emails locally.

Out fo 300k emails, I lost 220k in this case.

I'll send a PR fixing the systemd configuration, but I can't help but think there should be safeguards in SMD preventing it from destroying everything like this.

anarcat avatar Jun 29 '21 15:06 anarcat

Maybe my implementation of a lock file https://github.com/gares/syncmaildir/blob/388f3ebd351b3cb046b7d3d9c0cce12acdf2e18c/smd-common#L444 is buggy. I don't know if there is a sound lock implementation in shell script.

gares avatar Jun 30 '21 12:06 gares

that does look "racey": if two programs run this, one could check for the lockfile existence ([ -f $LOCKFILE ]) right before the other creates it (echo $$ > $LOCKFILE) not find it, and then overwrite the other program's lockfile...

typically, you would call flock to do atomic locking, but it doesn't work in all situations either (e.g. NFS). locking is hard, particularly from shell...

https://manpages.debian.org/flock

anarcat avatar Jun 30 '21 13:06 anarcat