lighthouse
lighthouse copied to clipboard
validator client opens too many file descriptors
Description
when loading validator keys, lighthouse seems to keep file descriptors hanging, which leads to the validator failing to start when it's loading a lot of keys (2000 in my case)
Version
Lighthouse v2.5.1-df51a73
Present Behaviour
Aug 15 02:08:45.487 INFO Enabled validator voting_pubkey: 0xa19c769736eea1675a7f4c398bb474dc1908b2485dbb20704c8a2369a09192a0488c464a56f31a4cd0b0f609d0253bc6, signing_method: local_keystore
Aug 15 02:08:46.600 ERRO Failed to initialize validator validator: 0x952add18112a2161dd0ef25587b680327b946e8fcc4e4e45a61e4e075062222094ba2ed18e445f13448c5f721e873dcc, signing_method: local_keystore, error: Lockfile(UnableToOpenFile("/home/banteg/.lighthouse/prater/validators/0x952add18112a2161dd0ef25587b680327b946e8fcc4e4e45a61e4e075062222094ba2ed18e445f13448c5f721e873dcc/keystore-m_12381_3600_827_0_0-1660500451.json.lock", Os { code: 24, kind: Uncategorized, message: "Too many open files" }))
Aug 15 02:08:46.649 CRIT Failed to start validator client reason: Unable to initialize validators: Lockfile(UnableToOpenFile("/home/banteg/.lighthouse/prater/validators/0x952add18112a2161dd0ef25587b680327b946e8fcc4e4e45a61e4e075062222094ba2ed18e445f13448c5f721e873dcc/keystore-m_12381_3600_827_0_0-1660500451.json.lock", Os { code: 24, kind: Uncategorized, message: "Too many open files" }))
Aug 15 02:08:46.649 INFO Internal shutdown received reason: Failed to start validator client
Aug 15 02:08:46.649 INFO Shutting down.. reason: Failure("Failed to start validator client")
Failed to start validator client
goerli.validator.service: Main process exited, code=exited, status=1/FAILURE
goerli.validator.service: Failed with result 'exit-code'.
Expected Behaviour
the client should successfully start up with any number of validators and default linux config (ulimit of 1024).
Steps to resolve
close the file after reading a keystore
We do close the keystore files after reading them, I think the issue you're encountering is due to the .lock file that we open for each keystore. This prevents accidental re-use of keystores by multiple validator clients.
Increasing the file descriptor limit is the recommended workaround (as you know). Open to other ideas for addressing this as well though.
my apologies, i didn't see these were the lock files, you are correct. i think this issue can only really manifest itself on testnets given how large of a stake you need, but it might be helpful to mention this in the docs in case someone else tries to run >1000 validators.
What's the default limit for file locks? In my case that's unlimited while the file descriptors is 1024 as expected:
❯ ulimit -aS
-t: cpu time (seconds) unlimited
-f: file size (blocks) unlimited
-d: data seg size (kbytes) unlimited
-s: stack size (kbytes) 8192
-c: core file size (blocks) 0
-m: resident set size (kbytes) unlimited
-u: processes 256512
-n: file descriptors 1024
-l: locked-in-memory size (kbytes) 8223860
-v: address space (kbytes) unlimited
-x: file locks unlimited
-i: pending signals 256512
-q: bytes in POSIX msg queues 819200
-e: max nice 0
-r: max rt priority 0
-N 15: rt cpu time (microseconds) unlimited
Could something like advisory locks work? I gave it a quick try using lslocks and it seems you can go over 1024 locks (currently tested with 5000)
Completed in #4796 🎉