lighthouse icon indicating copy to clipboard operation
lighthouse copied to clipboard

validator client opens too many file descriptors

Open banteg opened this issue 3 years ago • 3 comments

Description

when loading validator keys, lighthouse seems to keep file descriptors hanging, which leads to the validator failing to start when it's loading a lot of keys (2000 in my case)

Version

Lighthouse v2.5.1-df51a73

Present Behaviour

Aug 15 02:08:45.487 INFO Enabled validator                       voting_pubkey: 0xa19c769736eea1675a7f4c398bb474dc1908b2485dbb20704c8a2369a09192a0488c464a56f31a4cd0b0f609d0253bc6, signing_method: local_keystore
Aug 15 02:08:46.600 ERRO Failed to initialize validator          validator: 0x952add18112a2161dd0ef25587b680327b946e8fcc4e4e45a61e4e075062222094ba2ed18e445f13448c5f721e873dcc, signing_method: local_keystore, error: Lockfile(UnableToOpenFile("/home/banteg/.lighthouse/prater/validators/0x952add18112a2161dd0ef25587b680327b946e8fcc4e4e45a61e4e075062222094ba2ed18e445f13448c5f721e873dcc/keystore-m_12381_3600_827_0_0-1660500451.json.lock", Os { code: 24, kind: Uncategorized, message: "Too many open files" }))
Aug 15 02:08:46.649 CRIT Failed to start validator client        reason: Unable to initialize validators: Lockfile(UnableToOpenFile("/home/banteg/.lighthouse/prater/validators/0x952add18112a2161dd0ef25587b680327b946e8fcc4e4e45a61e4e075062222094ba2ed18e445f13448c5f721e873dcc/keystore-m_12381_3600_827_0_0-1660500451.json.lock", Os { code: 24, kind: Uncategorized, message: "Too many open files" }))
Aug 15 02:08:46.649 INFO Internal shutdown received              reason: Failed to start validator client
Aug 15 02:08:46.649 INFO Shutting down..                         reason: Failure("Failed to start validator client")
Failed to start validator client
goerli.validator.service: Main process exited, code=exited, status=1/FAILURE
goerli.validator.service: Failed with result 'exit-code'.

Expected Behaviour

the client should successfully start up with any number of validators and default linux config (ulimit of 1024).

Steps to resolve

close the file after reading a keystore

banteg avatar Aug 15 '22 02:08 banteg

We do close the keystore files after reading them, I think the issue you're encountering is due to the .lock file that we open for each keystore. This prevents accidental re-use of keystores by multiple validator clients.

Increasing the file descriptor limit is the recommended workaround (as you know). Open to other ideas for addressing this as well though.

michaelsproul avatar Aug 15 '22 02:08 michaelsproul

my apologies, i didn't see these were the lock files, you are correct. i think this issue can only really manifest itself on testnets given how large of a stake you need, but it might be helpful to mention this in the docs in case someone else tries to run >1000 validators.

banteg avatar Aug 15 '22 02:08 banteg

What's the default limit for file locks? In my case that's unlimited while the file descriptors is 1024 as expected:

❯ ulimit -aS
-t: cpu time (seconds)              unlimited
-f: file size (blocks)              unlimited
-d: data seg size (kbytes)          unlimited
-s: stack size (kbytes)             8192
-c: core file size (blocks)         0
-m: resident set size (kbytes)      unlimited
-u: processes                       256512
-n: file descriptors                1024
-l: locked-in-memory size (kbytes)  8223860
-v: address space (kbytes)          unlimited
-x: file locks                      unlimited
-i: pending signals                 256512
-q: bytes in POSIX msg queues       819200
-e: max nice                        0
-r: max rt priority                 0
-N 15: rt cpu time (microseconds)   unlimited

Could something like advisory locks work? I gave it a quick try using lslocks and it seems you can go over 1024 locks (currently tested with 5000)

alex88 avatar Sep 11 '22 16:09 alex88

Completed in #4796 🎉

jimmygchen avatar May 01 '24 01:05 jimmygchen