rippled Rippled locks up when client use pathing (Version: 1.9.1)

When clients connect to the node using pathing the node soon locks up.

Steps to Reproduce

Run node and connect some clients via websocket and path_find (note they need to be requesting the path_find from this node). https://xrpl-pathfinding.netlify.app is a simple way to do this.

Expected Result

Node does not lock up.

Actual Result

Node fails with, Screen Shot 2022-07-07 at 18 38 19 Now the node fails to restart and rippled server_info reports error. Only way I have found to fix this is to remove the DB, then restart the node

Environment

No LSB modules are available. Distributor ID: Ubuntu Description: Ubuntu 20.04.4 LTS Release: 20.04 Codename: focal

rippled 1.9.1 installed from apt.

Jul 07 '22 23:07 shortthefomo

Looping in @ximinez

Jul 08 '22 07:07 WietseWind

Still working on determining the cause here, but if you have fast_load=1 another workaround that doesn't require deleting your DB is to set fast_load=0.

Jul 08 '22 22:07 ximinez

Still working on determining the cause here, but if you have fast_load=1 another workaround that doesn't require deleting your DB is to set fast_load=0.

Interesting! Thank you, just deployed that to the pathfinding nodes. Let's see :) That saves some of the hassle.

Jul 09 '22 00:07 WietseWind

yes had the same flag here enabled will be giving that a go here as well

Jul 11 '22 22:07 shortthefomo

Just to add a disclaimer, I wouldn't use this workaround on any full-history servers or servers that need a lot of history, because I don't know if there are any other side effects.

Jul 12 '22 16:07 ximinez

Yup i'm running pathing on a separate submission node, keeping the more critical nodes separate and only reachable via other validators.

I've run into this again this morning here so ive just flipped to fast_load=0 it has brought the node back up as described above here :) thanks for that tip @ximinez

I'm not sure when the next rippled release is scheduled but I would suggest at minim add this comment about pathing and fast_load in the default rippled.cfg for anyone else. Hmm, strange just looked for that and don't find any mention of fastload in the default cfg. Maybe this todo can go when that default documentation is added to the cfg.

Jul 13 '22 14:07 shortthefomo

Just to add a disclaimer, I wouldn't use this workaround on any full-history servers or servers that need a lot of history, because I don't know if there are any other side effects.

Of course :) I realized that. Thanks for the heads up. This indeed keeps the pathfinding nodes restarting when crashing without the node store error. Thanks. Crash not resolved, but at least the servers are back at it in a headless way.

Jul 13 '22 15:07 WietseWind

Is this an out of memory (OOM) problem? Or is there something else going on here?

Sep 08 '23 05:09 intelliot

It could well be I do know my swap fills up swap vs actual memory I have 126 gigs available here in this box the swap is usually 100% after a while but still have many gigs free in physical ram

Oct 21 '23 02:10 shortthefomo

@lathanbritz Same here. No matter how much mem you give it (I tried 1TB 😂) it ends up eating it all and crashes.

You just give it more mem = more time. But it will be OOM killed.

^^ This one has a few hours left.

^^ This one is almost there.

For this reason we have a bunch of them for XRPLCluster and we take them down automatically, restart rippled and have them sync back up, add them back to the pool. They take turns this way. Crappy but somewhat functional.

Oct 21 '23 21:10 WietseWind

Hi @WietseWind ,

Do you mind sharing some of the requests you have been running? I will try to reproduce it in the lab. Thanks.

Sophia

On Sat, Oct 21, 2023 at 2:45 PM Wietse Wind @.***> wrote:

@lathanbritz https://urldefense.com/v3/__https://github.com/lathanbritz__;!!PZTMFYE!-hX7xzP4xbSfezv7dlyGLvaUvcte6G311fNXNZsXUyrKzxOiYq_zimZhsLoearJ-mZW1RIDIWcxae8n2HKG-$ Same here. No matter how much mem you give it (I tried 1TB 😂) it ends up eating it all and crashes.

You just give it more mem = more time. But it will be OOM killed. [image: image] https://urldefense.com/v3/__https://user-images.githubusercontent.com/4756161/277134959-f0fe984c-7800-441a-817a-16de7b8191ab.png__;!!PZTMFYE!-hX7xzP4xbSfezv7dlyGLvaUvcte6G311fNXNZsXUyrKzxOiYq_zimZhsLoearJ-mZW1RIDIWcxae7NnQTv3$

— Reply to this email directly, view it on GitHub https://urldefense.com/v3/__https://github.com/XRPLF/rippled/issues/4224*issuecomment-1773929660__;Iw!!PZTMFYE!-hX7xzP4xbSfezv7dlyGLvaUvcte6G311fNXNZsXUyrKzxOiYq_zimZhsLoearJ-mZW1RIDIWcxae6dh5L9u$, or unsubscribe https://urldefense.com/v3/__https://github.com/notifications/unsubscribe-auth/AZKCD267DXYLH26TFTVHBJ3YAQ66LAVCNFSM527CJZ4KU5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCNZXGM4TEOJWGYYA__;!!PZTMFYE!-hX7xzP4xbSfezv7dlyGLvaUvcte6G311fNXNZsXUyrKzxOiYq_zimZhsLoearJ-mZW1RIDIWcxaeyufxqiv$ . You are receiving this because you were assigned.Message ID: @.***>

Oct 22 '23 04:10 sophiax851

@sophiax851

The only calls directed to the machines are pathfinding calls, the subscription-ones, over Websocket.

Like what's started here: https://xrpl-pathfinding.netlify.app/ https://xrpl.org/path_find.html

Oct 23 '23 11:10 WietseWind

@WietseWind

Since you can reproduce it reliably, I was hoping to get some sample requests payload, or sample addresses and tokens for the request to try them out in lab. We could only reproduce it when manually setting up a very complex data model with interwoven trustlines on a single token, but this model eats up memory too quick to allow us to conduct any analysis before the host had the OOM. Since you have a realistic case and the growth is gradually, it might be easier to debug. It's ok if it's difficult to share, we can recreate the synthetic data.

Oct 23 '23 16:10 sophiax851

@sophiax851

MAINNET

{
    "id": "example",
    "command": "path_find",
    "subcommand": "create",
    "source_account": "rThREeXrp54XTQueDowPV1RxmkEAGUmg8",
    "destination_account": "rThREeXrp54XTQueDowPV1RxmkEAGUmg8",
    "destination_amount": "1000"
}

should be sufficient to produce this. https://github.com/WietseWind/Vue-Pathfinding-Demo/blob/249670cabf51e569baaeba80c62478ebffb66440/src/components/PathFinder.vue#L201

Oct 23 '23 18:10 shortthefomo

@lathanbritz Thanks for the sample. We've found the source of the memory growth and are working on a solution. Will keep you updated.

Nov 01 '23 20:11 sophiax851

@lathanbritz Thanks for the sample. We've found the source of the memory growth and are working on a solution. Will keep you updated.

Woahhhh!!! THANK YOU! That's AWESOME! :D

Nov 01 '23 21:11 WietseWind

https://github.com/XRPLF/rippled/pull/4822 contains a fix that we're planning to release in 2.0.1 (within a month of 2.0.0). If anyone is able to build that branch and run it on some of their path finding nodes, I'd love to get feedback about how well it works.

Nov 29 '23 00:11 ximinez

We conducted multiple rounds of testing and memory profiling to ensure that all objects created during the path_find request were cleared after the RPC request's exit and the termination of the WebSocket connection. Prior to the fix, using our test case, there was approximately 10GB of heap growth after just 21 path_find calls, resulting in about 12GB of RAM growth. With the fix in place, none of the previously accumulated objects are showing up in the memory snapshots taken during the test. So we believe the issue has been fixed but it'd be good that @WietseWind and @lathanbritz to also try it out and confirm since you have dedicated servers running this with broader range of requests.

Nov 29 '23 00:11 sophiax851

Actually more precisely speaking, the memory was cleared after each round of the path updating, so even with long lasting ongoing requests, rippled should not have memory accumulation

Nov 29 '23 01:11 sophiax851

rippled rippled copied to clipboard

Rippled locks up when client use pathing (Version: 1.9.1)

Steps to Reproduce

Expected Result

Actual Result

Environment

rippled
rippled copied to clipboard