lxcfs icon indicating copy to clipboard operation
lxcfs copied to clipboard

Limit number of workers to prevent system OOM

Open lathiat opened this issue 2 years ago • 4 comments

It seems there is no limit to the number of concurrent workers for LXCFS requests.

In situations where lxcfs requests are going slowly for some reason (whether deadlocked or just going slow due to high load or some other cause) and many such requests are coming in lxcfs can consume 1000s of threads and 10s-100s of GB of memory and crash the entire system. As seen while working #471 and #579.

I suggest that we need a limit, even if a fairly high one, to prevent this from happening. This should include non-debug level logging of when the limit is hit.

lathiat avatar Jan 17 '23 04:01 lathiat

Perhaps there should also be a timeout for worker thread in lxcfs after which it should return EIO to the application making the fuse call. That will prevent the libfuse+kernel deadlock even if we do end up with lots of stuck lxcfs threads?

nkshirsagar avatar Jan 17 '23 05:01 nkshirsagar

It seems I was mistaken and the OOM was primarily due to consuming applications behaving badly when their reads were stuck. So while this probably would still be ideal I mistakenly thought more ram was consumed by lxcfs.

A timeout may be sensible however depending on where the deadlock exists it may not be possible to action it.

lathiat avatar Jan 17 '23 06:01 lathiat

It seems I was mistaken and the OOM was primarily due to consuming applications behaving badly when their reads were stuck. So while this probably would still be ideal I mistakenly thought more ram was consumed by lxcfs.

A timeout may be sensible however depending on where the deadlock exists it may not be possible to action it.

@mihalicyn can lxcfs timeout if the worker thread does not return in a specified time and return EIO or similar to the caller?

nkshirsagar avatar Jan 17 '23 06:01 nkshirsagar

@lathiat @nkshirsagar yep, that's a good idea. I'll think about that, of course.

Upd: libfuse versions >= 3.12.0 has max_threads parameter https://github.com/libfuse/libfuse/commit/af5710e7a3ad42e1b64ee8882fd72b22ffe271ac

In snap environment Ubuntu Focal is used, so, we have libfuse 3.9.0

mihalicyn avatar Jan 17 '23 14:01 mihalicyn