cardano-private-testnet-setup icon indicating copy to clipboard operation
cardano-private-testnet-setup copied to clipboard

Solution to garbage collection/memory issues

Open mateusap1 opened this issue 3 years ago • 5 comments

After following the tutorial and being able to run everything successfully (thanks for this by the way), I tried closing the "automate.sh" shell to see if I would be able to get it running again later.

When I tried to execute ./run/all.sh, though, I started seeing a lot of errors and I noticed I was running into the same issue mentioned here, more specifically in this section:

Troubleshooting: If you run cardano-cli query tip and the blocks are not advancing or the syncProgress percent is decreasing, it may mean the processes running the nodes are running into garbage collection/memory issues. The author is still researching the cause of this issue. In any event, the best remedy is killing the run node processes, deleting the private-testnet folder and starting over. This garbage collection issue normally happens early in the update process.

After researching a little bit more, I started to think the problem could have something to do with the fact I was suddenly running all the nodes at the same time. So I tried, instead of ./run/all.sh, first ./run/node-bft1.sh and then ./run/node-bft2.sh, etc. To my surprise, it actually worked!

So I'm just submitting this issue, because it might be a good idea to update the "troubleshooting" section, warning users to run a node at a time, if there are no nodes running.

mateusap1 avatar Feb 09 '22 21:02 mateusap1

It's also a good idea for someone to try reproducing it

mateusap1 avatar Feb 09 '22 21:02 mateusap1

I tried it again, but now doing the same thing after shutting down all nodes and waiting like 30 minutes. It gave me the same issues again. What I think this means is that we always need a node creating new blocks, otherwise it will fail. The times I shutdown the nodes and run them again immediately, it worked, but when I waited long enough and tried starting it, it didn't. So my first suggestion is probably wrong.

mateusap1 avatar Feb 09 '22 22:02 mateusap1

thanks for the report @mateusap1 . There's many variables to this story unfortunately, some of which are beyond our control. Please provide basic information about your system and software versions installed.

Current effort evolves around distributing a docker/podman image and making it the de-facto recommended way of installation; this way most of the variability will go away. I will let you know once it's deployed, and it shouldn't be too long as the thing is basically already working on my localhost, and the only thing missing is some CD pipeline to build the images.

grzegorznowak avatar Feb 10 '22 05:02 grzegorznowak

Hi @mateusap1 , thanks for experimenting with things. Others including myself have had issues re-starting the nodes and picking up where it left off, when it was last shut-down. I guess I've always run the testnet as a fresh session, so this hasn't bothered me. Also, now that @grzegorznowak is working on adding some install aids, the setup work should be all automated and make fresh sessions easy to launch at least.

In mainnet/testnet, restarting a node can start with persisted chain state and rollback to some known good point, then sync up to current tip and run thereafter. It seems like the local testnet should operate the same way, so not sure what is different. It will take a deep dive into the code I suppose to figure things out.

woofpool avatar Feb 10 '22 12:02 woofpool

one observation I'm having now, after seeing how the newly added integration test suite works is that there's generally no issue with restarting the nodes, provided their processes were shut off completely. The updated version will do it automatically so maybe will make the rerunning of the automate.sh script more consistent

It might also be worth mentioning that as soon as #11 is merged into main, running the script with 1 as an extra parameter: automate.sh 1 will not flush the underlying blockchain files

grzegorznowak avatar Feb 12 '22 07:02 grzegorznowak