rippled
rippled copied to clipboard
Rippled locks up when client use pathing (Version: 1.9.1)
When clients connect to the node using pathing the node soon locks up.
Steps to Reproduce
Run node and connect some clients via websocket and path_find (note they need to be requesting the path_find from this node). https://xrpl-pathfinding.netlify.app is a simple way to do this.
Expected Result
Node does not lock up.
Actual Result
Node fails with,
Now the node fails to restart and rippled server_info reports error.
Only way I have found to fix this is to remove the DB, then restart the node
Environment
No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.4 LTS Release: 20.04 Codename: focal
rippled 1.9.1 installed from apt.
Looping in @ximinez
Still working on determining the cause here, but if you have fast_load=1 another workaround that doesn't require deleting your DB is to set fast_load=0.
Still working on determining the cause here, but if you have
fast_load=1another workaround that doesn't require deleting your DB is to setfast_load=0.
Interesting! Thank you, just deployed that to the pathfinding nodes. Let's see :) That saves some of the hassle.
yes had the same flag here enabled will be giving that a go here as well
Just to add a disclaimer, I wouldn't use this workaround on any full-history servers or servers that need a lot of history, because I don't know if there are any other side effects.
Yup i'm running pathing on a separate submission node, keeping the more critical nodes separate and only reachable via other validators.
I've run into this again this morning here so ive just flipped to fast_load=0 it has brought the node back up as described above here :) thanks for that tip @ximinez
I'm not sure when the next rippled release is scheduled but I would suggest at minim add this comment about pathing and fast_load in the default rippled.cfg for anyone else. Hmm, strange just looked for that and don't find any mention of fastload in the default cfg. Maybe this todo can go when that default documentation is added to the cfg.
Just to add a disclaimer, I wouldn't use this workaround on any full-history servers or servers that need a lot of history, because I don't know if there are any other side effects.
Of course :) I realized that. Thanks for the heads up. This indeed keeps the pathfinding nodes restarting when crashing without the node store error. Thanks. Crash not resolved, but at least the servers are back at it in a headless way.
Is this an out of memory (OOM) problem? Or is there something else going on here?
It could well be I do know my swap fills up swap vs actual memory I have 126 gigs available here in this box the swap is usually 100% after a while but still have many gigs free in physical ram
@lathanbritz Same here. No matter how much mem you give it (I tried 1TB 😂) it ends up eating it all and crashes.
You just give it more mem = more time. But it will be OOM killed.
^^ This one has a few hours left.
^^ This one is almost there.
For this reason we have a bunch of them for XRPLCluster and we take them down automatically, restart rippled and have them sync back up, add them back to the pool. They take turns this way. Crappy but somewhat functional.
Hi @WietseWind ,
Do you mind sharing some of the requests you have been running? I will try to reproduce it in the lab. Thanks.
Sophia
On Sat, Oct 21, 2023 at 2:45 PM Wietse Wind @.***> wrote:
@lathanbritz https://urldefense.com/v3/__https://github.com/lathanbritz__;!!PZTMFYE!-hX7xzP4xbSfezv7dlyGLvaUvcte6G311fNXNZsXUyrKzxOiYq_zimZhsLoearJ-mZW1RIDIWcxae8n2HKG-$ Same here. No matter how much mem you give it (I tried 1TB 😂) it ends up eating it all and crashes.
You just give it more mem = more time. But it will be OOM killed. [image: image] https://urldefense.com/v3/__https://user-images.githubusercontent.com/4756161/277134959-f0fe984c-7800-441a-817a-16de7b8191ab.png__;!!PZTMFYE!-hX7xzP4xbSfezv7dlyGLvaUvcte6G311fNXNZsXUyrKzxOiYq_zimZhsLoearJ-mZW1RIDIWcxae7NnQTv3$
— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/XRPLF/rippled/issues/4224*issuecomment-1773929660__;Iw!!PZTMFYE!-hX7xzP4xbSfezv7dlyGLvaUvcte6G311fNXNZsXUyrKzxOiYq_zimZhsLoearJ-mZW1RIDIWcxae6dh5L9u$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AZKCD267DXYLH26TFTVHBJ3YAQ66LAVCNFSM527CJZ4KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZXGM4TEOJWGYYA__;!!PZTMFYE!-hX7xzP4xbSfezv7dlyGLvaUvcte6G311fNXNZsXUyrKzxOiYq_zimZhsLoearJ-mZW1RIDIWcxaeyufxqiv$ . You are receiving this because you were assigned.Message ID: @.***>
@sophiax851
The only calls directed to the machines are pathfinding calls, the subscription-ones, over Websocket.
Like what's started here: https://xrpl-pathfinding.netlify.app/ https://xrpl.org/path_find.html
@WietseWind
Since you can reproduce it reliably, I was hoping to get some sample requests payload, or sample addresses and tokens for the request to try them out in lab. We could only reproduce it when manually setting up a very complex data model with interwoven trustlines on a single token, but this model eats up memory too quick to allow us to conduct any analysis before the host had the OOM. Since you have a realistic case and the growth is gradually, it might be easier to debug. It's ok if it's difficult to share, we can recreate the synthetic data.
@sophiax851
MAINNET
{
"id": "example",
"command": "path_find",
"subcommand": "create",
"source_account": "rThREeXrp54XTQueDowPV1RxmkEAGUmg8",
"destination_account": "rThREeXrp54XTQueDowPV1RxmkEAGUmg8",
"destination_amount": "1000"
}
should be sufficient to produce this. https://github.com/WietseWind/Vue-Pathfinding-Demo/blob/249670cabf51e569baaeba80c62478ebffb66440/src/components/PathFinder.vue#L201
@lathanbritz Thanks for the sample. We've found the source of the memory growth and are working on a solution. Will keep you updated.
@lathanbritz Thanks for the sample. We've found the source of the memory growth and are working on a solution. Will keep you updated.
Woahhhh!!! THANK YOU! That's AWESOME! :D
https://github.com/XRPLF/rippled/pull/4822 contains a fix that we're planning to release in 2.0.1 (within a month of 2.0.0). If anyone is able to build that branch and run it on some of their path finding nodes, I'd love to get feedback about how well it works.
We conducted multiple rounds of testing and memory profiling to ensure that all objects created during the path_find request were cleared after the RPC request's exit and the termination of the WebSocket connection. Prior to the fix, using our test case, there was approximately 10GB of heap growth after just 21 path_find calls, resulting in about 12GB of RAM growth. With the fix in place, none of the previously accumulated objects are showing up in the memory snapshots taken during the test. So we believe the issue has been fixed but it'd be good that @WietseWind and @lathanbritz to also try it out and confirm since you have dedicated servers running this with broader range of requests.
Actually more precisely speaking, the memory was cleared after each round of the path updating, so even with long lasting ongoing requests, rippled should not have memory accumulation