holochain [BUG] Long-running conductor eventually jumps to consume 100% cpu and stops logging

Describe the bug I need to test this in the latest 0.3.x and see if it still occurs, but I just wanted to tentatively flag it here anyway.

Run holochain conductor for 1.75 hrs (works great, CPU usage of processes fluctuate around 1-2%)
CPU usage of holochain processes jumps up to 99-100% consistently
Messages like these are repeated in the logs for several minutes:

Sep  5 19:31:12 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:12.094#033[0m #033[33m WARN#033[0m tx5_go_pion_sys: Failed to read from candidate udp4 relay 137.184.154.88:58982 related 0.0.0.0:32957: read udp 137.184.154.88:58982: use of closed network connection
Sep  5 19:31:12 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:12.096#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: Setting new connection state: Failed
Sep  5 19:31:12 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:12.096#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: failed to send packet: write udp 167.172.130.240:42026->192.222.206.57:53029: use of closed network connection
Sep  5 19:31:12 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:12.096#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: ICE connection state changed: failed
Sep  5 19:31:12 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:12.097#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: peer connection state changed: failed
Sep  5 19:31:16 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:16.437#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: Setting new connection state: Disconnected
Sep  5 19:31:16 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:16.438#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: ICE connection state changed: disconnected
Sep  5 19:31:16 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:16.438#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: peer connection state changed: disconnected
Sep  5 19:31:24 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:24.950#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: Setting new connection state: Disconnected
Sep  5 19:31:24 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:24.951#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: ICE connection state changed: disconnected
Sep  5 19:31:24 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:24.951#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: peer connection state changed: disconnected
Sep  5 19:31:32 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:32.430#033[0m #033[33m WARN#033[0m tx5_go_pion_sys: discard message from (udp4 prflx 72.229.143.50:60568 related :0), unknown TransactionID 0xf4e695f9aab29b15609af8d4
Sep  5 19:31:32 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:32.431#033[0m #033[33m WARN#033[0m tx5_go_pion_sys: discard message from (udp4 prflx 72.229.143.50:60568 related :0), unknown TransactionID 0x6eb23d7efa84915d6813383e
Sep  5 19:31:32 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:32.431#033[0m #033[33m WARN#033[0m tx5_go_pion_sys: discard message from (udp4 prflx 72.229.143.50:60568 related :0), unknown TransactionID 0x0c47031c7397a693af9fc777
Sep  5 19:31:32 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:32.432#033[0m #033[33m WARN#033[0m tx5_go_pion_sys: discard message from (udp4 prflx 72.229.143.50:60568 related :0), unknown TransactionID 0xfa2fa38934c6929bf7c889f6
Sep  5 19:31:32 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:32.953#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: Setting new connection state: Connected
Sep  5 19:31:32 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:32.953#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: ICE connection state changed: connected
Sep  5 19:31:32 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:32.954#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: peer connection state changed: connected
Sep  5 19:31:40 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:40.448#033[0m #033[33m WARN#033[0m tx5_go_pion_sys: Failed to read from candidate udp4 relay 137.184.154.88:58157 related 0.0.0.0:40617: read udp 137.184.154.88:58157: use of closed network connection
Sep  5 19:31:40 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:40.448#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: Setting new connection state: Failed
Sep  5 19:31:40 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:40.449#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: failed to send packet: write udp 167.172.130.240:56218->191.103.86.39:48445: use of closed network connection
Sep  5 19:31:40 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:40.449#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: ICE connection state changed: failed
Sep  5 19:31:40 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:40.449#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: peer connection state changed: failed
Sep  5 19:31:46 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:46.958#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: Setting new connection state: Disconnected
Sep  5 19:31:46 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:46.959#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: ICE connection state changed: disconnected
Sep  5 19:31:46 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:31:46.959#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: peer connection state changed: disconnected
Sep  5 19:32:12 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:32:12.967#033[0m #033[33m WARN#033[0m tx5_go_pion_sys: Failed to read from candidate udp4 relay 137.184.154.88:50816 related 0.0.0.0:35370: read udp 137.184.154.88:50816: use of closed network connection
Sep  5 19:32:12 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:32:12.968#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: Setting new connection state: Failed
Sep  5 19:32:12 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:32:12.969#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: failed to send packet: write udp 167.172.130.240:49210->72.229.143.50:60568: use of closed network connection
Sep  5 19:32:12 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:32:12.969#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: ICE connection state changed: failed
Sep  5 19:32:12 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:32:12.970#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: peer connection state changed: failed
Sep  5 19:33:43 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:33:43.726#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: Setting new connection state: Disconnected
Sep  5 19:33:43 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:33:43.727#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: ICE connection state changed: disconnected
Sep  5 19:33:43 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:33:43.727#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: peer connection state changed: disconnected
Sep  5 19:34:09 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:34:09.735#033[0m #033[33m WARN#033[0m tx5_go_pion_sys: Failed to read from candidate udp4 relay 137.184.154.88:50534 related 0.0.0.0:53610: read udp 137.184.154.88:50534: use of closed network connection
Sep  5 19:34:09 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:34:09.736#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: Setting new connection state: Failed
Sep  5 19:34:09 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:34:09.736#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: failed to send packet: write udp 167.172.130.240:46808->174.208.165.91:4256: use of closed network connection
Sep  5 19:34:09 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:34:09.736#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: ICE connection state changed: failed
Sep  5 19:34:09 ubuntu-s-1vcpu-512mb-10gb-nyc1-01 run-mewsfeed-service.sh[1697]: #033[2mSep 05 19:34:09.736#033[0m #033[32m INFO#033[0m tx5_go_pion_sys: peer connection state changed: failed

After a few minutes of that, it stops generating any new log messages, but cpu usage remains at 100%
Upon restarting the conductor, it returns to normal CPU usage and behavior

Expected behavior Long-running conductors behave the same forever.

System information:

OS: Ubuntu 22.04
Browser : n/a
Holochain and HDK Version: 0.2.1

Steps to reproduce

Download mewsfeed happ: https://github.com/GeekGene/mewsfeed/releases/download/v0.10.0-rc.4/mewsfeed.happ
Download holochain-runner: https://github.com/lightningrodlabs/holochain-runner/releases/download/v0.7.6/holochain-runner-x86_64-unknown-linux-gnu.tar.gz
Run echo "pass" | RUST_LOG=info ./holochain-runner $(pwd)/mewsfeed.happ --keystore-path $(pwd)/keystore $(pwd)/databases
Monitor CPU usage for a few hours, when you see it at 100%, check if holochain is still producing logs

Screenshots n/a

Additional context n/a

Sep 05 '23 19:09 mattyg

We've had similar issues during DWeb prep with the Emergence app under load and running HoloFuel for longer periods of time. Though in both of those cases CPU usage dropped to near 0% when the logs stopped.

My best guess is that these are starvation problems. We're not seeing a lot of evidence of deadlocks exactly but there are critical resources that if held longer than expected could lock up the whole system. I've added quite a few timeouts which are intended to let problems resolve themselves which has helped. I've also worked on reducing complexity for code that uses semaphores around the database code. There have been some other improvements intended to make these freezes less likely.

Clearly there's still more needed!

I'll have a go at reproducing this with the steps above and maybe something stands out as needing attention that we haven't looked at yet.

Sep 06 '23 17:09 ThetaSinner

My best guess is that these are starvation problems. We're not seeing a lot of evidence of deadlocks exactly but there are critical resources that if held longer than expected could lock up the whole system. I've added quite a few timeouts which are intended to let problems resolve themselves which has helped. I've also worked on reducing complexity for code that uses semaphores around the database code. There have been some other improvements intended to make these freezes less likely.

Interesting. Maybe I should also note I'm on a pretty low-power VPS with just 1GB of ram and a 1GB swap file and a single slow cpu.

Sep 06 '23 18:09 mattyg

I observe the same on my 2018 intel macbook pro after running Acorn consistently for around 2 hours. I can't confirm about the logs being the same, but I could look in to that.

Sep 11 '23 00:09 Connoropolous

I observe the same on my 2018 intel macbook pro after running Acorn consistently for around 2 hours. I can't confirm about the logs being the same, but I could look in to that.

I was running 3 vm's (Debian 12 Windows 10 and 11 all with 8gb ram and 4 cores) with Acorn and they were all using 100% CPU. I believe that this is the same issue but I can also test if you let me know the best way of going about it as well as recording data.

Sep 11 '23 09:09 TechSupport26

This has now been reported by a couple of people within the Holochain org too. I haven't had any success yet reproducing it but I'm still working on it. it's probably going to take a few goes though to narrow down what is going wrong.

It certainly seems like this isn't an uncommon issue though. It would be useful to hear

If anybody runs into it on 0.3.x dev builds or
Has logs for as long as possible from before this happens

Even if I reproduce it, cross-referencing logs is a useful way to narrow the search space. If we're lucky enough that the logs are giving a hint about what is wrong of course.

Sep 15 '23 16:09 ThetaSinner

I reproduced the issue on 0.2.1 while taking logs of the conductor and top output.

Conductor logs conductor-logs.log

Output of top | grep holochain with datetime prepended: top-output.log

The interesting part is at 2023-09-18T00:47:20Z when top output starts to show CPU usage jump between 90-100% and sit there the rest of the time (I left it running for another 2 hours). The holochain conductor logs for another 5 minutes and then stops.

I'll try to reproduce on the latest holochain 0.3.x-beta-dev.17 once it gets included in holonix.

Sep 18 '23 03:09 mattyg

Thanks @mattyg, that's really useful. There are a few issues in that log, some of which are normal, others that look like good hints.

The database being that overloaded is a concern. Given this is a small machine we can assume we have the minimum database connection pool size of 4, so seeing 1100%+ means we have 40+ read queries waiting for access to the DHT database. Just this will cause other things to back up while they wait on the database.
It looks like the validation receipt workflow is having some trouble. It's not finding peers to send to which is fine but it is retrying 60 times on each retry. That's a reasonably slow loop so that probably isn't causing CPU usage problems on its own but it does look like it could use a bit of optimization. This is then multiplied by the number of receipts to be sent because they're pushed one by one and 'no peers' isn't being treated as fatal for that workflow run.
Knowing we don't have the peers we need to work on validation, the errors we're seeing in sys validation are probably relevant too. I'd assume if we don't have anybody to gossip with that we won't receive the deps we need for validation to proceed so the failure behavior of sys validation is probably worth digging into too.

It's very hard to be sure at this point but this does look like something along the lines of gaps in error handling and retry logic taking the program into resource exhaustion. I think the main thing for me is that I can't test on my own machine, I'll switch to a VM and see if this is easier to reproduce there.

@TechSupport26 Sorry I've just seen you asked for input. You probably don't need to do anything special to reproduce this. If you see this happen again then dropping the logs on this issue would be really helpful! Probably the only other useful bit of information at this point would be the Holochain version because it doesn't seem to be app-specific at all.

Sep 18 '23 13:09 ThetaSinner

Great glad we've got some leads. I restarted the conductor and took logs again (was a little delayed in starting logging by a minute). This time it almost immediately got into the bad state:

Conductor: conductor-logs.log

Top: top-output.log

LMK if this is still helpful. Happy to take logs from additional reproductions if useful.

Sep 18 '23 16:09 mattyg

Thanks that is useful, especially the timestamps alongside top output! I think I have enough logs for now, time make some changes and re-test.

I have been able to reproduce this a couple of times now.

Using the Launcher to run Mewsfeed+TalkingStickies on my desktop:

my-launcher-100-cpu.log

Also on a mini computer running just Holochain and Mewsfeed v0.11.1. I can't compress that log small enough to upload because it's 1Gb raw. I can grab some stats about that though, given that I did nothing beyond installing the happ into Holochain.

The DHT database is ~68Mb I ran Holochain for 5 hours and 21 minutes in which time it used over 77 hours of CPU time.

Initial top output at 14:06

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
845 conduct+  20   0   38.5g   1.0g  69804 S   3.7  13.0   1:28.86 holochain

After running for a while at 19:27

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
845 conduct+  20   0  104.6g   1.4g  71292 S 101.3  19.3  77:15.73 holochain

My observations seem to show CPU use fairly high when syncing with a peer which is reasonable. Then it was quiet for a while (~2 hours). From there CPU usage grew incrementally to the 100% mark and a little above.

Looking at it again now, a short while later at 19:59, the CPU usage is staying consistently high. Judging by the TIME field, it is not dropping below that 100% mark at all.

PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
845 conduct+  20   0  104.6g   1.5g  71260 S 102.0  19.3 110:38.69 holochain

EDIT: I want to really stress that I did no interaction with the happ, because that rules out a whole lot of code. No data created by me. It's enough for Holochain to try to sync with other nodes that are not reliably reachable.

Sep 25 '23 19:09 ThetaSinner

Awesome! glad you were able to reproduce.

EDIT: I want to really stress that I did no interaction with the happ, because that rules out a whole lot of code. No data created by me. It's enough for Holochain to try to sync with other nodes that are not reliably reachable.

Likewise -- all my logs were from conductor running headless on a vps with no interaction by the agent (reading or writing)

Sep 25 '23 23:09 mattyg

I think I'm getting somewhere with this. I found a problem with the validation receipt workflow running very slowly. I'm still working on speeding that up but it's looking good -> https://github.com/holochain/holochain/pull/2848

The other problem appears to be a 'flooding' issue of sorts where we can build up a lot of triggers for workflows to run when they don't actually need to. I believe this is what was building up over time and causing such high CPU usage. Or at least one of the causes. I'm still discussing that one with the team but I have seen good results running it so I will work towards turning that into a fix too -> https://github.com/holochain/holochain/pull/2870

This work isn't quite ready to go but I thought it'd be good to give a progress update anyway

Oct 03 '23 12:10 ThetaSinner

Unfortunately I'm not as close as I hoped with this. The code I had identified isn't causing the problem. I'm back to the drawing board on this one. I still think it's a workflow issue and am working through the workflows looking for an issue with how they trigger each other. From the logs we're seeing missing dependencies being identified in system validation and I'm hoping there's something to find around how the publish/incoming/validation workflows behave when that happens and there isn't a peer available to fetch the missing op from.

On the bright side, I am finding and fixing bugs, just not this one yet :)

It's also worth mentioning that the PR linked above for validation receipts (#2848) has been merged and back-ported. That should significantly speed up that workflow when the peer to send to is unreachable. That will clean up the logs and save on a whole lot of CPU time if nothing else. That'll be in the next 0.2.3 release candidate which I'm hoping we can put out next week.

Oct 06 '23 18:10 ThetaSinner

I want to appreciate you @ThetaSinner I see what you are doing as importantly crucial. Thanks :hugs:

Oct 06 '23 18:10 guillemcordoba

This item has been open for 30 days with no activity.

Nov 06 '23 02:11 github-actions[bot]

Adding permanent label. We're through a lot of the workflows and have hit the sys/app validation workflows which are complex. Once we get through those and the integration workflow I plan to return to this and re-test. I still think that finding and fixing bugs in the workflows is likely to have an impact here. Even if it doesn't it's turning up plenty of bugs that are worth fixing.

Nov 06 '23 16:11 ThetaSinner

In at least some of the logs I've seen where this has happened, there has been a lot of log output around the fetch pool in Kitsune. I don't believe it will fix the cause of this problem but I expect this will make an improvement to the symptoms -> #3227

Jan 19 '24 17:01 ThetaSinner