dkg-engine icon indicating copy to clipboard operation
dkg-engine copied to clipboard

High number of Timeouts and EMFILE errors

Open botnumberseven opened this issue 4 years ago • 3 comments

Expected behavior

smooth sailing

Actual behavior

There are several things which make two of my nodes very different to to others:

  1. High number of "Timed out waiting for response", there could be 1000-3000 in a couple of hours. And there could be a dozen within the same millisecond.
  2. High number of "warn - connect emfile 66.94.99.223:5306 - local (undefined:undefined)", which i've never seen on normal functioning nodes. After these warnings I can see emfile errors like this one "error - Caught exception: Error: spawn /usr/bin/node EMFILE."
  3. At some point after hundreds of Timeouts and EMFILE errors node process got terminated with exit code 1 and restarts. But the restart doesn't really fix the issue.

So it seems like the node tries to open a lot of TCP connections / sockets here and at some point OS limits it. As i can see Netdata reports TCP queue overflow and drops from it. Also

Although node continue to bid on jobs and receive "I haven't been chosen"

I tried to remove some data (import_cache, kadence.dht, replication_cache, bootstraps.json, peercache, router.json) as an emulation of restore from backup. But that didn't help really.

Steps to reproduce the problem

Not sure, since these 2 nodes are no different to others, but only them demonstrate the issue. I'm not clear on what triggers this behaviour.

Specifications

  • Node version: 5.1.1
  • Platform: Ubuntu 18.04
  • Node wallet:
  • ERC725 identity: 0xB9712dbeD9769ED25500Eb2e123472a86f45e6F7 and 0x9bc66a5e01fbfcb3e804cc60ad80ddc84ee17024

Error logs

Example of timeout within the same millisecond image

Example of warn emfile image

Example of EMFILE error image

Example of when node exited with return code 1 image

TCP drops image

Disclaimer

Please be aware that the issue reported on a public repository allows everyone to see your node logs, node details, and contact details. If you have any sensitive information, feel free to share it by sending an email to [email protected].

botnumberseven avatar Sep 15 '21 19:09 botnumberseven

I'm seeing this as well...

calr0x avatar Sep 22 '21 12:09 calr0x

Hey @calr0x and @botnumberseven thanks for this submission. We've also seen this error occasionally and have pinpointed it to the Kadence library. From the tests we've performed this error doesn't affect the node functionality (other than looking bad in the logs), with the lib being able to handle this, however it is in the scrum pipeline. Also as we are going to replace kadence with a different kademlia implementation in v6 (due to this, but also other issues discovered), this will soon be a thing of the past.

branarakic avatar Oct 06 '21 15:10 branarakic

@branarakic agree, since Kadence library will be completely replaced in v6, it's not wroth the efforts to fix in v5.

botnumberseven avatar Oct 06 '21 16:10 botnumberseven

This issue is not relevant because it was for v5 and current version of OT-node is v6.

NZT48 avatar Dec 27 '22 12:12 NZT48