[BUG] Steady memory increase during thumbnail generation
Description of the Bug
The more thumbnails are generated the more memory the application needs.
Expected Behavior
The application memory should not depend on the amount of thumbnails that have to be generated.
To Reproduce
Steps to reproduce the behavior:
- Run the application in a container with a hard memory limit.
- Add about a 100 books - this amount seems to require around 1GB of RAM.
- Regenerate the thumbnails in the library with different memory limits.
Screenshots
If applicable, add screenshots to help explain your problem.
Debugging Information
journalctl -r
Environment:
- OS: Arch Linux
Build Details:
- Version: 0.0.10
- Docker: yes
Additional context
Run as Nomad job. Data (books) are mounted as read only.
The more thumbnails are generated the more memory the application needs
Are you actually observing a memory leak, e.g., the memory continues to steadily increase after the job completes, or are you just observing high resource utilization during the lifecycle of the job?
Stump will, by default, generate 50 thumbnails at a time. This means it will load 50 full-res images into memory, convert and resize them in memory, then dump to the disk. You can configure this value to be less, so it doesn't process more than a preferred amount at once: https://www.stumpapp.dev/guides/configuration/server-options#stump_max_thumbnail_concurrency
Hey @aaronleopold, thank you for the fast reaction. Yes, it is not a leak but a steady increase in memory.
I just set STUMP_MAX_THUMBNAIL_CONCURRENCY=10. Deleted all thumbnails and started regenerating. Around thumbnail 144 OOM Killed. Memory limit is 512MB of RAM.
With STUMP_MAX_THUMBNAIL_CONCURRENCY=3 the 512MB were enough. Maybe this default setting of 50 is a bit too high as a default?
Yeah of course! No problem
I just set
STUMP_MAX_THUMBNAIL_CONCURRENCY=10. Deleted all thumbnails and started regenerating. Around thumbnail 144 OOM Killed. Memory limit is 512MB of RAM.
So this definitely points to something else going on, but not 100% sure what. You're saying it OOM'd around batch 14 (144 images / 10), which implies there is some memory being held by the container between each batch and aligns with:
it is not a leak but a steady increase in memory
Are you able to see the memory go down after the job completes? Or does docker not release the memory?
Also, could you share a little more about what the content is? Is it largely a specific format or a mix (and either way, what the general makeup is)? E.g., PDFs, EPUBs, CBZs, etc.
With
STUMP_MAX_THUMBNAIL_CONCURRENCY=3the 512MB were enough. Maybe this default setting of 50 is a bit too high as a default?
I am open to adjusting the default to something lower for sure
If the images that have already been resized and saved to disk are not being freed before the next batch of thunbnails are generated, then I would classify that as a memory leak, but a memory leak that happens during the process of the job, no?
If the job handles 10 thumbnails concurrently, then it should only have 10 full-sized thumbnails in memory at a time.
If the images that have already been resized and saved to disk are not being freed before the next batch of thunbnails are generated, then I would classify that as a memory leak, but a memory leak that happens during the process of the job, no?
I think yes, technically, but I mostly asked the distinction originally to better understand the scope of the problem. I wasn't intending to bikeshed the semantics of the issue or invalidate the report.
At the end of the day, there is still a problem here either way
Are you able to see the memory go down after the job completes? Or does docker not release the memory?
I do not have good observability. I have to do some additional work to get more details.
All I can see is the OOM kill event. This usually happens after the application is notified that it is under memory pressure.
Also, could you share a little more about what the content is? Is it largely a specific format or a mix (and either way, what the general makeup is)? E.g., PDFs, EPUBs, CBZs, etc.
The content is a mix of PDFs and EPUBs. There are 1-2 large PDFs (~100MB) in there. The EPUBs are small (~10MB max).
There is not a big performance difference between STUMP_MAX_THUMBNAIL_CONCURRENCY=3 and STUMP_MAX_THUMBNAIL_CONCURRENCY=10. The machine is an EPYC 7002 server. The book collection is on a HDD, not SSD.
STUMP_MAX_THUMBNAIL_CONCURRENCY=3
Collection of about 150 books. Memory limit for the container set at 512MB.
| PID | USER | PR | NI | VIRT | RES | SHR | S | %CPU | %MEM | TIME+ | COMMAND | comment |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 292992 | user | 20 | 0 | 5157488 | 213032 | 37260 | S | 0.0 | 0.2 | 0:09.22 | stump | before thumbnail generation |
| 292992 | user | 20 | 0 | 5157488 | 290408 | 37952 | S | 0.0 | 0.2 | 0:20.38 | stump | after thumbnail generation |
| 292992 | user | 20 | 0 | 7008912 | 298088 | 37952 | S | 0.0 | 0.2 | 0:21.70 | stump | after some clicking around without opening a book |
| 292992 | user | 20 | 0 | 7008912 | 299112 | 37952 | S | 0.0 | 0.2 | 0:21.83 | stump | after some more clicking around without opening a book |
Maybe the web server thinks that if there is enough RAM on the system we can just continue caching stuff up?
There are 1-2 large PDFs (~100MB) in there
This could potentially be related to https://github.com/stumpapp/stump/issues/668, not overly confident but they also have had memory issues with PDF processing (which this would fall under).
I tried to look at the relevant code during my lunch to sus out what might be causing the memory issues, and while I don't have anything concrete I did draft something that might improve the situation. I did it on my working branch for the large backend migration, so I'll aim to port it to the non-migration versions to try and load test a bit
@teodorkostov I pushed a new image with some tweaks if you're able to try it out. It is based on the nightly image, so if you want to revert back to the current latest I would backup your database or just spin up another, separate container.
If anyone is able to help verify whether this improves the situation, please report back any findings here. Otherwise, I will shift my focus back to the big migration since I have limited capacity to juggle both at the moment