Bug(duplication):some nodes never start GC plog after computer room failure
Bug Report
Please answer these questions before submitting your issue. Thanks!
-
What did you do? The computer room which service for our duplication master cluster meet an accidents. And most of the node in this room shutdown in a short time. When all the nodes alive , we found some partition of the duplication table never GC private log (plog) again.
-
What did you expect to see? All the partition can GC it's plog correctly.
-
What did you see instead? stdout (error log):
// stdout
90146:E2024-05-14 15:59:52.512 (1715673592512665104 67086) replica.default8.040005fe0319646c: nfs_server_impl.cpp:221:on_get_file_size(): {nfs_service} get stat of file /home/work/ssd2/pegasus/alsgsrv-monetization-master/replica/reps/8.53.pegasus/plog/log.18129.608864535790 failed, err = No such file or directory
We can see this replica request a old plog.
Because the partition can not clear plog as nomarl,so the disk always fully. We need to clear the plog sometimes.
- What version of Pegasus are you using? Pegasus v2.4
Is the file /home/work/ssd2/pegasus/alsgsrv-monetization-master/replica/reps/8.53.pegasus/plog/log.18129.608864535790 actually exists or not?
Is the file /home/work/ssd2/pegasus/alsgsrv-monetization-master/replica/reps/8.53.pegasus/plog/log.18129.608864535790 actually exists or not?
When coredump happened, the file actually not exists.