[BUG] Stump should limit the parallel jobs, somehow
Bear in mind I'm new to this project, having just installed it a few hours ago. Please forgive any misunderstandings.
Description of the Bug
I installed Stump via docker-compose on my modest 4-cores/threads Intel i5 mini-home-server (with 20GB of RAM). Then I started adding several of my book directories to Stump. This included around 10K or 20K files, among the several Humble Bundle book bundles, de-drm ebooks, and other miscellaneous PDF and CBZ and similar files.
After adding those directories, each one as a separate library (around 20 libraries), Stump started scanned them.
Minutes later, I noticed it stopped responding, and even the other services hosted on the same mini-home-server also stopped responding.
After waiting for several more minutes, I finally got an SSH shell and could check htop. By this time, whatever Stump process that was running had stopped already. Probably aborted. But I noticed the CPU load average was around 200~250. That's too much!
Expected Behavior
It should scan everything, but it should never overload the machine.
I would expect Stump to have a pool of workers that would gradually consume the list of pending items to be scanned. This pool of workers would have a fixed size, and should never overload the machine.
I'm sorry this isn't a very deep or technical bug report, but I haven't investigated much further (yet).
To Reproduce
Steps to reproduce the behavior:
- Get as many book bundles as you can. (e.g. collected over several years from Humble Bundle)
- Store them in a NAS (Network-Attached Storage).
- Mount the NAS directory on your server.
- Configure docker-compose to use this mounted NAS directory as the
/datainside the container. (And mark it as read-only, just out of caution.) - Start Stump via
docker compose - Add the directory(ies) as library(ies) in Stump.
- Wait. Observe
htop, observe the loading average, observe other server health metrics.
Debugging Information
Environment:
- OS: Debian Linux
- Device: Intel Core i5 + 20GB of RAM + 1Gbps Ethernet cable connection
Build Details:
- Version: aaronleopold/stump, tag = latest, image id = 347f99f0f814
- Docker: yes
Upon further inspection, I believe there could be some memory leak somewhere. Or some mismanagement of the threads. There are too many open threads.
dmesg shows:
Out of memory: Killed process 434962 (stump) total-vm:39640724kB, anon-rss:14125300kB, file-rss:0kB, shmem-rss:0kB, UID:1000 pgtables:29640kB oom_score_adj:0
htop shows that sh /entrypoint.sh is taking more and more and more memory over time.
docker compose stats shows:
htop shows:
And while writing this comment, the usage according to docker compose stats increased to:
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
84e8da84a53d stump 93.32% 13.21GiB / 19.41GiB 68.04% 98.7kB / 212kB 110MB / 90.8MB 105
Meanwhile, the Stump web browser UI shows:
Hey! Thanks for the detailed report! I'll preface my response by saying there are some knobs you can adjust to try and help with this, mostly the STUMP_MAX_SCANNER_CONCURRENCY configuration option. Also just want to address:
I'm sorry this isn't a very deep or technical bug report, but I haven't investigated much further (yet).
This was a super helpful and well-written report! No sorries needed on my end.
Offhand, I can't say where an issue might necessarily be taking place, but for context here's a bit of information about how Stump handles concurrency and jobs:
- Stump should not be running any concurrent jobs, not referring to threads but rather jobs within the application (e.g., a scan of a specific library). The queuing of jobs happens here
- Stump jobs generally follow a pattern of being comprised of functional tasks that are run sequentially (not concurrently) with subtasks that may run in parallel
- There are a few controllers which spawn their own threads for lifecycle management, like the job controller
Your primary issue is with the scanner, which follows that pattern of functional tasks (e.g., walk library, walk series, create media, etc) run sequentially. Walking the library is pretty trivial and I won't necessarily look there as the culprit, but for context here is walk_library that basically sets up the rest of the scanning process. The part of the scanner I'd look to first for issues would be the bit that loads files into memory (to some extent) to be processed. It kicks off a STUMP_MAX_SCANNER_CONCURRENCY amount of futures that process files in parallel. Each unit of work (a file) is processed in its own blocking thread.
It could very well be a flawed approach to how I built the scanner, I am not super savvy with IO operations within an asynchronous context, this entire project was created as a means of me learning Rust, so I may have made some fundamental mistakes in that realm. Hopefully, though, this gives a bit more context on how it works under the hood to tease out where the issue might be. If you have any thoughts about the implementation details, not sure how technical you are, please let me know!
The majority of my focus will remain on https://github.com/stumpapp/stump/discussions/634, but I obviously care about any performance issues so will do what I can to address this too. A couple follow-ups I can think of with all that context out of the way:
- If it is a problem with the loading/processing of files step of the scanner lifecycle, I would imagine a secondary scan wouldn't necessarily present much of an issue since it wouldn't have to process any books outside those which changed on disk
- I'd be curious if tweaking that knob I mentioned above helps at all
I'll note that while using docker you don't need to expose 100% of your compute resources to every single container. You can limit them easily in your docker compose syntax.
https://docs.docker.com/reference/compose-file/deploy/#resources
So doing:
deploy:
resources:
limits:
cpus: "1.0"
memory: 300M
Would limit the container to 1 CPU core and 300 MB of RAM. This way for applications that don't always have the ability to limit themselves, you can at least stop it from consuming all the available resources of the host and locking things up.
On further testing…
My Humble Bundle book collection is split into multiple directories, one per file type. I noticed that:
- A directory containing ~~332~~ 166 CBZ files (17.5GB total) was indexed with no problems at all.
- A directory containing ~~808~~ 404 EPUB files (18.1GB total) was indexed with no problems at all.
- A directory containing ~~977~~ 487 PDF files (44.6GB total) caused trouble. Shortly after the indexing started, I noticed the memory usage going up and up and up.
(EDIT: I had originally miscounted the amount of files.)
https://docs.docker.com/reference/compose-file/deploy/#resources
I'm unsure what's the difference between the deploy top-level element and the mem_limit per-service option. Would both work? Is there any difference?
https://docs.docker.com/reference/compose-file/services/#mem_limit
This way for applications that don't always have the ability to limit themselves, you can at least stop it from consuming all the available resources of the host and locking things up.
Sounds like a good idea to include this in the example docker-compose file from this repository.
I wouldn't start including host specific things like memory limits to the default compose. Everyone's setup is unique and by including it in the example then the project effectively takes ownership of any problems it might introduce.
I've added STUMP_MAX_SCANNER_CONCURRENCY=16 and I tried indexing the directory with ~~977~~ 487 PDF files again. Same issue. Or at least a similar-enough issue. Look at a screenshot from lazydocker:
Not sure why the memory usage went down a few times. Could be because it hit the 6GB memory limit I imposed in the docker-compose file. At least it seems it was progressing, and without slowing down my whole system. And it seems it managed to index everything correctly this time.
I think it could be worth at least mentioning how docker users could go about imposing memory limits in the docs, I'm all for accessibility of information. I'd agree in leaving out of the default/example compose file, though. I'll try and add a note about it somewhere in there before the end of the weekend.
Wrt your further testing, when you say a directory was indexed with no problems at all was this a secondary scan (i.e., all the books were already indexed, processed, and added to the database without changes to disk)? I'm mostly trying to gauge how suspicious I should be of the entire scanning process, or if we can narrow it down to something being problematic in the PDF processing.
I'll admit that is my least used format, outside of a couple DnD books and some one-offs from independent publishers I've purchased I don't have any in my libraries. So reproducing your issues, if they are tied to PDF files, would be tricky
I wouldn't say it's a secondary scan. I deleted the library and re-created it pointing to that directory. So, in my understanding, it was a first scan.
I noticed in the logs it mentioned (at least) once that it ran out of memory.
Regardless, everything got added to the library. Maybe after a secondary follow-up scan just to be extra sure everything was there.
I'll try to contact you by e-mail so I can help you reproducing this bug.