dragonboat icon indicating copy to clipboard operation
dragonboat copied to clipboard

Snapshot save error

Open uber42 opened this issue 3 years ago • 7 comments

panic: /home/user/repos/storage-69397425/3/raft/dev/00000000000000000001/snapshot-part-1/snapshot-1-3 doesn't exist when creating /home/user/repos/storage-69397425/3/raft/dev/00000000000000000001/snapshot-part-1/snapshot-1-3/snapshot-00000000000003E9-3.generating

goroutine 350 [running]:
github.com/lni/dragonboat/v3/internal/fileutil.Mkdir({0xc004358280, 0x97}, {0x1639918, 0x1d79520})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/[email protected]/internal/fileutil/utils.go:122 +0x2dc
github.com/lni/dragonboat/v3/internal/server.(*SSEnv).createDir(0xc01f9486f0, {0xc004358280, 0x97})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/[email protected]/internal/server/snapshotenv.go:251 +0x86
github.com/lni/dragonboat/v3/internal/server.(*SSEnv).CreateTempDir(0xc01f9486f0)
        /home/user/go/pkg/mod/github.com/lni/dragonboat/[email protected]/internal/server/snapshotenv.go:200 +0x45
github.com/lni/dragonboat/v3.(*snapshotter).Save(_, {_, _}, {0x3, 0x3e9, 0x169, 0x3e9, {0x0, 0x0, {0x0, ...}, ...}, ...})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/[email protected]/snapshotter.go:104 +0x125
github.com/lni/dragonboat/v3/internal/rsm.(*StateMachine).doSave(_, {0x3, 0x3e9, 0x169, 0x3e9, {0x0, 0x0, {0x0, 0x0}, 0x0, ...}, ...})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/[email protected]/internal/rsm/statemachine.go:802 +0x193
github.com/lni/dragonboat/v3/internal/rsm.(*StateMachine).concurrentSave(_, {_, _, {_, _}, _, _})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/[email protected]/internal/rsm/statemachine.go:758 +0x358
github.com/lni/dragonboat/v3/internal/rsm.(*StateMachine).Save(_, {_, _, {_, _}, _, _})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/[email protected]/internal/rsm/statemachine.go:509 +0x2a5
github.com/lni/dragonboat/v3.(*node).doSave(0xc000420800, {0x0, 0x0, {0x0, 0x0}, 0x0, 0x0})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/[email protected]/node.go:705 +0x2d6
github.com/lni/dragonboat/v3.(*node).save(0xc000420800, {0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0, 0x0, ...})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/[email protected]/node.go:684 +0x7b
github.com/lni/dragonboat/v3.(*ssWorker).save(0xc0003a9f60, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0, 0x0, ...}, ...})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/[email protected]/engine.go:296 +0x78
github.com/lni/dragonboat/v3.(*ssWorker).handle(0xc0003a9f60, {{0x0, 0x0, 0x0, 0x0, 0x0, 0x1, 0x0, 0x0, 0x0, ...}, ...})
        /home/user/go/pkg/mod/github.com/lni/dragonboat/[email protected]/engine.go:279 +0xba
github.com/lni/dragonboat/v3.(*ssWorker).workerMain(0xc0003a9f60)
        /home/user/go/pkg/mod/github.com/lni/dragonboat/[email protected]/engine.go:265 +0x1bb
github.com/lni/dragonboat/v3.newSSWorker.func1()
        /home/user/go/pkg/mod/github.com/lni/dragonboat/[email protected]/engine.go:251 +0x25
github.com/lni/goutils/syncutil.(*Stopper).runWorker.func1()
        /home/user/go/pkg/mod/github.com/lni/[email protected]/syncutil/stopper.go:79 +0x173
created by github.com/lni/goutils/syncutil.(*Stopper).runWorker
        /home/user/go/pkg/mod/github.com/lni/[email protected]/syncutil/stopper.go:74 +0x133

Dragonboat version

v3.3.1

Steps to reproduce the behavior

Couldn't reproduce again

uber42 avatar Feb 07 '22 16:02 uber42

hi @uber42 , thanks for reporting the above issue.

Could you please confirm what filesystem was used? It is a local file system or some networked file system like NFS?

lni avatar Feb 11 '22 07:02 lni

hi, I use ext4

uber42 avatar Feb 12 '22 19:02 uber42

@uber42 thanks for the info.

As you can see from the error log -

/home/user/repos/storage-69397425/3/raft/dev/00000000000000000001/snapshot-part-1/snapshot-1-3 doesn't exist when creating /home/user/repos/storage-69397425/3/raft/dev/00000000000000000001/snapshot-part-1/snapshot-1-3/snapshot-00000000000003E9-3.generating

the dir "/home/user/repos/storage-69397425/3/raft/dev/00000000000000000001/snapshot-part-1/snapshot-1-3" is missing when a new snapshot is about to be created inside it.

this dir is created when the node is started in NodeHost.startCluster(). I don't think there is any code that would delete the dir.

any chance that it might be deleted by some of your code?

lni avatar Feb 13 '22 14:02 lni

The root raft directory cannot be deleted by our code. This result was obtained while testing our project with various fault injections, including network partition between nodes. Perhaps a change of leader may appear such behavior. Logs unfortunately lost :(

uber42 avatar Feb 13 '22 20:02 uber42

@uber42 thanks for the info.

I have the feeling that this issue is highly unlikely to be caused by Dragonboat's code. If you check the source code, node's snapshot dir is never deleted, dragonboat only deletes whats in the directory. Large scale fault injection tests are a part of dragonboat's development process for years, it was fine in all those tests.

Could you please try to re-run your tests and provide the full log when you can reproduce the issue? Really want to help you to get to the bottom of this. Thanks.

lni avatar Feb 14 '22 10:02 lni

I will try to reproduce, but so far this is an isolated case for a very large number of tests.

uber42 avatar Feb 14 '22 11:02 uber42

@uber42 did you manage to get this reproduced?

lni avatar Apr 05 '22 13:04 lni