forest
forest copied to clipboard
GC fixes and improvements.
Issue summary
This issue is a placeholder for a collection of things that needs to be checked and improved in regards to GC.
- Seems to be stuck on graph traversal, both ordered (single-threaded) and unordered. This might have something to do with running GC in the background while running a node. CPU spikes at times hinting at the fact that it works, but it takes days and never gets to the cleanup step.
- Needs to be properly profiled on a running node. Could be done locally, the important thing here is to see what is happening when this runs in the background.
Other information and links
Being an important part of a long-running node, we should ensure it is properly monitored. Our own infrastructure node should, therefore, employ a check (optimally, based on a metric) or a set of checks, e.g., at least 1 garbage collection performed in 2 days (to be safe), at least N records cleaned or something along those lines.
This needs to be verified that it works correctly and fixed if it doesn't, or closed if it does. https://github.com/ChainSafe/forest/issues/3645
This needs to be verified that it works correctly and fixed if it doesn't, or closed if it does. #3645
This issue is irrelevant to the new GC, there also isn't a reliable nor easy way to verify it as the GC process takes a long time between various steps and ultimately removes unreachable data that is older than 2*chain_finality.
There is a different issue though that has to do with the fact that the GC does not work properly at the moment. I would keep this issue open until we have a working Mark'n'sweep GC.
So, should we close #3645?
So, should we close #3645?
Let's keep it open for now and close it when the new GC works.
The current status is:
- The GC works on calibnet on a 2* vCPU node with 16GB RAM.
- The GC seems to break the node syncing on mainnet with the above setup:
2024-06-24T14:50:10.282997Z ERROR forest_filecoin::chain_sync::tipset_syncer: Sync messages check state failed for tipset range
2024-06-24T14:50:10.283110Z ERROR forest_filecoin::chain_sync::tipset_syncer: Syncing tipset range [4023789, 4031140] failed: Validation error: Validation error: Parent state root did not match computed state: bafy2bzacec2cp3bfmv4ddkixmaxssbwxrz2ux6snu3f5sya6osqzeahzjt456 (header), bafy2bzaceafg3f5ndycqg6zpvtnaw23mkskzie2h5cm3lv5teti35koy2qlhi (computed)
- There is a PR up that executes all the removals after iterating columns. The above bug is observed with the new code, previously it used to make the node stuck.
It looks like the issue has to do with database concurrency. In order to reproduce this it'd be nice to pause node syncing for the duration of the GC writes (deletes) to see if that isolates the issue.
We concluded that the most likely reason for having data consistency issues must be truncated hash collisions.
Therefore it's going to be tested on this branch: lemmih/gc-testing falling back to CIDs. It's memory-consuming, but better than non-functioning.
Once that's confirmed it would be nice to figure out the next steps in terms of memory optimisation.
Fixed via #4425