stump icon indicating copy to clipboard operation
stump copied to clipboard

[BUG] Steady memory increase during thumbnail generation

Open teodorkostov opened this issue 6 months ago • 10 comments

Description of the Bug

The more thumbnails are generated the more memory the application needs.

Expected Behavior

The application memory should not depend on the amount of thumbnails that have to be generated.

To Reproduce

Steps to reproduce the behavior:

  1. Run the application in a container with a hard memory limit.
  2. Add about a 100 books - this amount seems to require around 1GB of RAM.
  3. Regenerate the thumbnails in the library with different memory limits.

Screenshots

If applicable, add screenshots to help explain your problem.

Debugging Information

journalctl -r

Environment:

  • OS: Arch Linux

Build Details:

  • Version: 0.0.10
  • Docker: yes

Additional context

Run as Nomad job. Data (books) are mounted as read only.

teodorkostov avatar Jul 10 '25 14:07 teodorkostov

The more thumbnails are generated the more memory the application needs

Are you actually observing a memory leak, e.g., the memory continues to steadily increase after the job completes, or are you just observing high resource utilization during the lifecycle of the job?

Stump will, by default, generate 50 thumbnails at a time. This means it will load 50 full-res images into memory, convert and resize them in memory, then dump to the disk. You can configure this value to be less, so it doesn't process more than a preferred amount at once: https://www.stumpapp.dev/guides/configuration/server-options#stump_max_thumbnail_concurrency

aaronleopold avatar Jul 10 '25 14:07 aaronleopold

Hey @aaronleopold, thank you for the fast reaction. Yes, it is not a leak but a steady increase in memory.

I just set STUMP_MAX_THUMBNAIL_CONCURRENCY=10. Deleted all thumbnails and started regenerating. Around thumbnail 144 OOM Killed. Memory limit is 512MB of RAM.

With STUMP_MAX_THUMBNAIL_CONCURRENCY=3 the 512MB were enough. Maybe this default setting of 50 is a bit too high as a default?

teodorkostov avatar Jul 10 '25 15:07 teodorkostov

Yeah of course! No problem

I just set STUMP_MAX_THUMBNAIL_CONCURRENCY=10. Deleted all thumbnails and started regenerating. Around thumbnail 144 OOM Killed. Memory limit is 512MB of RAM.

So this definitely points to something else going on, but not 100% sure what. You're saying it OOM'd around batch 14 (144 images / 10), which implies there is some memory being held by the container between each batch and aligns with:

it is not a leak but a steady increase in memory

Are you able to see the memory go down after the job completes? Or does docker not release the memory?

Also, could you share a little more about what the content is? Is it largely a specific format or a mix (and either way, what the general makeup is)? E.g., PDFs, EPUBs, CBZs, etc.

With STUMP_MAX_THUMBNAIL_CONCURRENCY=3 the 512MB were enough. Maybe this default setting of 50 is a bit too high as a default?

I am open to adjusting the default to something lower for sure

aaronleopold avatar Jul 10 '25 16:07 aaronleopold

If the images that have already been resized and saved to disk are not being freed before the next batch of thunbnails are generated, then I would classify that as a memory leak, but a memory leak that happens during the process of the job, no?

If the job handles 10 thumbnails concurrently, then it should only have 10 full-sized thumbnails in memory at a time.

clseibold avatar Jul 10 '25 19:07 clseibold

If the images that have already been resized and saved to disk are not being freed before the next batch of thunbnails are generated, then I would classify that as a memory leak, but a memory leak that happens during the process of the job, no?

I think yes, technically, but I mostly asked the distinction originally to better understand the scope of the problem. I wasn't intending to bikeshed the semantics of the issue or invalidate the report.

At the end of the day, there is still a problem here either way

aaronleopold avatar Jul 10 '25 19:07 aaronleopold

Are you able to see the memory go down after the job completes? Or does docker not release the memory?

I do not have good observability. I have to do some additional work to get more details.

All I can see is the OOM kill event. This usually happens after the application is notified that it is under memory pressure.

Also, could you share a little more about what the content is? Is it largely a specific format or a mix (and either way, what the general makeup is)? E.g., PDFs, EPUBs, CBZs, etc.

The content is a mix of PDFs and EPUBs. There are 1-2 large PDFs (~100MB) in there. The EPUBs are small (~10MB max).

There is not a big performance difference between STUMP_MAX_THUMBNAIL_CONCURRENCY=3 and STUMP_MAX_THUMBNAIL_CONCURRENCY=10. The machine is an EPYC 7002 server. The book collection is on a HDD, not SSD.

teodorkostov avatar Jul 11 '25 10:07 teodorkostov

STUMP_MAX_THUMBNAIL_CONCURRENCY=3

Collection of about 150 books. Memory limit for the container set at 512MB.

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND comment
292992 user 20 0 5157488 213032 37260 S 0.0 0.2 0:09.22 stump before thumbnail generation
292992 user 20 0 5157488 290408 37952 S 0.0 0.2 0:20.38 stump after thumbnail generation
292992 user 20 0 7008912 298088 37952 S 0.0 0.2 0:21.70 stump after some clicking around without opening a book
292992 user 20 0 7008912 299112 37952 S 0.0 0.2 0:21.83 stump after some more clicking around without opening a book

Maybe the web server thinks that if there is enough RAM on the system we can just continue caching stuff up?

teodorkostov avatar Jul 11 '25 11:07 teodorkostov

There are 1-2 large PDFs (~100MB) in there

This could potentially be related to https://github.com/stumpapp/stump/issues/668, not overly confident but they also have had memory issues with PDF processing (which this would fall under).

I tried to look at the relevant code during my lunch to sus out what might be causing the memory issues, and while I don't have anything concrete I did draft something that might improve the situation. I did it on my working branch for the large backend migration, so I'll aim to port it to the non-migration versions to try and load test a bit

aaronleopold avatar Jul 12 '25 01:07 aaronleopold

@teodorkostov I pushed a new image with some tweaks if you're able to try it out. It is based on the nightly image, so if you want to revert back to the current latest I would backup your database or just spin up another, separate container.

aaronleopold avatar Jul 17 '25 21:07 aaronleopold

If anyone is able to help verify whether this improves the situation, please report back any findings here. Otherwise, I will shift my focus back to the big migration since I have limited capacity to juggle both at the moment

aaronleopold avatar Jul 24 '25 17:07 aaronleopold