rippled icon indicating copy to clipboard operation
rippled copied to clipboard

Rippled locks up when client use pathing (Version: 1.9.1)

Open shortthefomo opened this issue 3 years ago • 7 comments

When clients connect to the node using pathing the node soon locks up.

Steps to Reproduce

Run node and connect some clients via websocket and path_find (note they need to be requesting the path_find from this node). https://xrpl-pathfinding.netlify.app is a simple way to do this.

Expected Result

Node does not lock up.

Actual Result

Node fails with, Screen Shot 2022-07-07 at 18 38 19 Now the node fails to restart and rippled server_info reports error. Only way I have found to fix this is to remove the DB, then restart the node

Environment

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.4 LTS Release: 20.04 Codename: focal

rippled 1.9.1 installed from apt.

shortthefomo avatar Jul 07 '22 23:07 shortthefomo

Looping in @ximinez

WietseWind avatar Jul 08 '22 07:07 WietseWind

Still working on determining the cause here, but if you have fast_load=1 another workaround that doesn't require deleting your DB is to set fast_load=0.

ximinez avatar Jul 08 '22 22:07 ximinez

Still working on determining the cause here, but if you have fast_load=1 another workaround that doesn't require deleting your DB is to set fast_load=0.

Interesting! Thank you, just deployed that to the pathfinding nodes. Let's see :) That saves some of the hassle.

WietseWind avatar Jul 09 '22 00:07 WietseWind

yes had the same flag here enabled will be giving that a go here as well

shortthefomo avatar Jul 11 '22 22:07 shortthefomo

Just to add a disclaimer, I wouldn't use this workaround on any full-history servers or servers that need a lot of history, because I don't know if there are any other side effects.

ximinez avatar Jul 12 '22 16:07 ximinez

Yup i'm running pathing on a separate submission node, keeping the more critical nodes separate and only reachable via other validators.

I've run into this again this morning here so ive just flipped to fast_load=0 it has brought the node back up as described above here :) thanks for that tip @ximinez

I'm not sure when the next rippled release is scheduled but I would suggest at minim add this comment about pathing and fast_load in the default rippled.cfg for anyone else. Hmm, strange just looked for that and don't find any mention of fastload in the default cfg. Maybe this todo can go when that default documentation is added to the cfg.

shortthefomo avatar Jul 13 '22 14:07 shortthefomo

Just to add a disclaimer, I wouldn't use this workaround on any full-history servers or servers that need a lot of history, because I don't know if there are any other side effects.

Of course :) I realized that. Thanks for the heads up. This indeed keeps the pathfinding nodes restarting when crashing without the node store error. Thanks. Crash not resolved, but at least the servers are back at it in a headless way.

WietseWind avatar Jul 13 '22 15:07 WietseWind

Is this an out of memory (OOM) problem? Or is there something else going on here?

intelliot avatar Sep 08 '23 05:09 intelliot

It could well be I do know my swap fills up swap vs actual memory I have 126 gigs available here in this box the swap is usually 100% after a while but still have many gigs free in physical ram image

shortthefomo avatar Oct 21 '23 02:10 shortthefomo

@lathanbritz Same here. No matter how much mem you give it (I tried 1TB 😂) it ends up eating it all and crashes.

You just give it more mem = more time. But it will be OOM killed.

image

^^ This one has a few hours left.

image

^^ This one is almost there.

For this reason we have a bunch of them for XRPLCluster and we take them down automatically, restart rippled and have them sync back up, add them back to the pool. They take turns this way. Crappy but somewhat functional.

WietseWind avatar Oct 21 '23 21:10 WietseWind

@sophiax851

The only calls directed to the machines are pathfinding calls, the subscription-ones, over Websocket.

Like what's started here: https://xrpl-pathfinding.netlify.app/ https://xrpl.org/path_find.html

WietseWind avatar Oct 23 '23 11:10 WietseWind

@WietseWind

Since you can reproduce it reliably, I was hoping to get some sample requests payload, or sample addresses and tokens for the request to try them out in lab. We could only reproduce it when manually setting up a very complex data model with interwoven trustlines on a single token, but this model eats up memory too quick to allow us to conduct any analysis before the host had the OOM. Since you have a realistic case and the growth is gradually, it might be easier to debug. It's ok if it's difficult to share, we can recreate the synthetic data.

sophiax851 avatar Oct 23 '23 16:10 sophiax851

@sophiax851

MAINNET

{
    "id": "example",
    "command": "path_find",
    "subcommand": "create",
    "source_account": "rThREeXrp54XTQueDowPV1RxmkEAGUmg8",
    "destination_account": "rThREeXrp54XTQueDowPV1RxmkEAGUmg8",
    "destination_amount": "1000"
}

should be sufficient to produce this. https://github.com/WietseWind/Vue-Pathfinding-Demo/blob/249670cabf51e569baaeba80c62478ebffb66440/src/components/PathFinder.vue#L201

shortthefomo avatar Oct 23 '23 18:10 shortthefomo

@lathanbritz Thanks for the sample. We've found the source of the memory growth and are working on a solution. Will keep you updated.

sophiax851 avatar Nov 01 '23 20:11 sophiax851

@lathanbritz Thanks for the sample. We've found the source of the memory growth and are working on a solution. Will keep you updated.

Woahhhh!!! THANK YOU! That's AWESOME! :D

WietseWind avatar Nov 01 '23 21:11 WietseWind

https://github.com/XRPLF/rippled/pull/4822 contains a fix that we're planning to release in 2.0.1 (within a month of 2.0.0). If anyone is able to build that branch and run it on some of their path finding nodes, I'd love to get feedback about how well it works.

ximinez avatar Nov 29 '23 00:11 ximinez

We conducted multiple rounds of testing and memory profiling to ensure that all objects created during the path_find request were cleared after the RPC request's exit and the termination of the WebSocket connection. Prior to the fix, using our test case, there was approximately 10GB of heap growth after just 21 path_find calls, resulting in about 12GB of RAM growth. With the fix in place, none of the previously accumulated objects are showing up in the memory snapshots taken during the test. So we believe the issue has been fixed but it'd be good that @WietseWind and @lathanbritz to also try it out and confirm since you have dedicated servers running this with broader range of requests.

sophiax851 avatar Nov 29 '23 00:11 sophiax851

Actually more precisely speaking, the memory was cleared after each round of the path updating, so even with long lasting ongoing requests, rippled should not have memory accumulation

sophiax851 avatar Nov 29 '23 01:11 sophiax851