hugo icon indicating copy to clipboard operation
hugo copied to clipboard

hugo stalls when hitting hard memory limit

Open TiGR opened this issue 1 year ago • 4 comments

Not sure how to reproduce it, maybe by reducing HUGO_MEMORYLIMIT and taking some huge website to test. In case of our huge website we can normally run hugo build with 8Gb memory limit, but if I run hugo server on the same limit, hugo hits the limit at some point and then all cpu usage drops to 0% and hugo just sits there, without doing a thing (no new files are created too).

Running hugo server --logLevel info

Start building sites … 
hugo v0.124.0-DEV+extended linux/amd64 BuildDate=unknown

INFO  static: syncing static files to / duration 24.426106ms
INFO  build:  step process substep collect files 37000 files_total 37000 duration 3.013660535s
INFO  build:  step process substep collect files 102422 files_total 139422 duration 2.721319973s
INFO  build:  step process duration 5.735132833s
INFO  build:  step assemble duration 4.368413267s
INFO  build:  step render substep pages site en outputFormat html duration 1m14.605222904s
INFO  build:  step render substep pages site en outputFormat csv duration 439.775463ms
INFO  build:  step render substep pages site en outputFormat json duration 1.694182069s
INFO  build:  step render substep pages site en outputFormat rss duration 856.068872ms
INFO  dynacache: adjusted partitions' max size evicted 7858 numGC 64 limit 7.68 GB alloc 8.00 GB totalAlloc 83.67 GB
INFO  dynacache: adjusted partitions' max size evicted 7237 numGC 65 limit 7.68 GB alloc 8.17 GB totalAlloc 87.76 GB
INFO  dynacache: adjusted partitions' max size evicted 6099 numGC 70 limit 7.68 GB alloc 8.09 GB totalAlloc 107.07 GB
INFO  dynacache: adjusted partitions' max size evicted 7398 numGC 71 limit 7.68 GB alloc 8.23 GB totalAlloc 107.39 GB
INFO  dynacache: adjusted partitions' max size evicted 2218 numGC 71 limit 7.68 GB alloc 7.96 GB totalAlloc 111.00 GB
INFO  build:  step render substep pages site es outputFormat html duration 1m17.620461596s
INFO  build:  step render substep pages site es outputFormat csv duration 376.800492ms
INFO  build:  step render substep pages site es outputFormat json duration 1.446087386s
INFO  build:  step render substep pages site es outputFormat rss duration 582.231744ms
INFO  dynacache: adjusted partitions' max size evicted 8436 numGC 75 limit 7.68 GB alloc 8.16 GB totalAlloc 125.20 GB
INFO  dynacache: adjusted partitions' max size evicted 4562 numGC 76 limit 7.68 GB alloc 7.76 GB totalAlloc 128.92 GB
INFO  dynacache: adjusted partitions' max size evicted 2837 numGC 77 limit 7.68 GB alloc 8.18 GB totalAlloc 133.44 GB
INFO  dynacache: adjusted partitions' max size evicted 3769 numGC 78 limit 7.68 GB alloc 8.42 GB totalAlloc 137.72 GB
INFO  dynacache: adjusted partitions' max size evicted 7235 numGC 79 limit 7.68 GB alloc 8.72 GB totalAlloc 142.06 GB
INFO  dynacache: adjusted partitions' max size evicted 7577 numGC 80 limit 7.68 GB alloc 8.08 GB totalAlloc 145.92 GB
INFO  dynacache: adjusted partitions' max size evicted 8250 numGC 81 limit 7.68 GB alloc 7.89 GB totalAlloc 149.91 GB
INFO  dynacache: adjusted partitions' max size evicted 7581 numGC 82 limit 7.68 GB alloc 7.77 GB totalAlloc 153.98 GB
INFO  dynacache: adjusted partitions' max size evicted 3157 numGC 84 limit 7.68 GB alloc 8.19 GB totalAlloc 162.63 GB
INFO  dynacache: adjusted partitions' max size evicted 1492 numGC 85 limit 7.68 GB alloc 7.99 GB totalAlloc 166.28 GB
INFO  dynacache: adjusted partitions' max size evicted 5796 numGC 86 limit 7.68 GB alloc 7.97 GB totalAlloc 167.16 GB
INFO  dynacache: adjusted partitions' max size evicted 2363 numGC 86 limit 7.68 GB alloc 8.10 GB totalAlloc 170.69 GB

And here it just hangs. I'd expect it to fail somehow if it has run out of memory.

Also, if I reduce the limit, regular hugo build also stalls:

$ HUGO_MEMORYLIMIT=3 ~/go/bin/hugo --logLevel info
Start building sites … 
hugo v0.124.0-DEV+extended linux/amd64 BuildDate=unknown

INFO  static: syncing static files to / duration 146.63746ms
INFO  build:  step process substep collect files 37000 files_total 37000 duration 3.00951775s
INFO  build:  step process substep collect files 102422 files_total 139422 duration 2.852218241s
INFO  build:  step process duration 5.861893274s
INFO  build:  step assemble duration 4.53627309s
INFO  dynacache: adjusted partitions' max size evicted 2282 numGC 26 limit 3.00 GB alloc 4.09 GB totalAlloc 7.56 GB
INFO  dynacache: adjusted partitions' max size evicted 4109 numGC 27 limit 3.00 GB alloc 3.62 GB totalAlloc 8.89 GB
INFO  dynacache: adjusted partitions' max size evicted 3740 numGC 27 limit 3.00 GB alloc 4.76 GB totalAlloc 10.04 GB
INFO  dynacache: adjusted partitions' max size evicted 2188 numGC 30 limit 3.00 GB alloc 4.38 GB totalAlloc 16.54 GB

And here it stalled.

TiGR avatar Feb 23 '24 15:02 TiGR

I suspect this is a deadlock with the cache mutex (or something), which isn't the easiest thing in the world to debug, but I guess I should be able to trigger this. I have tested this pretty heavily using 1GB limit on some very big sites, but I guess the devil is in the detail.

A temporary workaround, I guess, would be to set the memorylimit to something very high.

bep avatar Feb 23 '24 15:02 bep

I have pushed 564bae06f6513cbca80bd54411f9a66ec2115995 -- which I don't think will fix your problem.

I have spent some time building some pretty beefy sites with HUGO_MEMORYLIMIT=1, but I haven't been able to force this issue.

Note that I don't doubt that this is a real issue that needs to be fixed, but I need to look closer to reproduce it.

I notice from the log that the amount of memory you allocate is on a complete different level than the sites I'm testing on. Can you elaborate a little what all the memory is used for? Image processing? Do you have a image cache?

bep avatar Feb 23 '24 16:02 bep

It is really hard for me to determine what hugo uses memory for. Yes, we have 2,5 thousands of images, we have around 12 thousands of pages (multiplied by 10 translations). Most of images have previews generated, many of them are additionally resized and dumped as both jpeg and webp, base language build generates 25 thousands images.

Do you have a image cache?

Do you mean resources/_gen? If so, then yes.

TiGR avatar Feb 23 '24 18:02 TiGR

I have pushed 564bae0 -- which I don't think will fix your problem.

It does not, hugo still stalls on master.

TiGR avatar Feb 24 '24 07:02 TiGR

We start seeing stalls in our cloud builds (we have hugo 0.123.2 there at this point). We have the same memory limits we had for 122, that is HUGO_MEMORYLIMIT is 22. Hugo during the build do not exceed 11Gb. But we see random unpredictable stalls. I can't reproduce anything like this on my machine, but we have this problem in cloud.

Example build:

time=2024-02-27T11:24:30Z | msg="Start building sites … "
time=2024-02-27T11:24:30Z | msg="hugo v0.123.2-929b91fe75cb0d041f22b4707700dfc117115ad4+extended linux/amd64 BuildDate=2024-02-22T15:27:15Z VendorInfo=gohugoio"
time=2024-02-27T11:24:30Z | msg="INFO  static: syncing static files to /"
time=2024-02-27T11:24:30Z | msg=" duration 141.026102ms"
time=2024-02-27T11:24:32Z | msg="INFO  dynacache: adjusted partitions' max size evicted 4537 numGC 16 limit 22.00 GB alloc 0.17 GB totalAlloc 0.37 GB"
time=2024-02-27T11:24:33Z | msg="INFO  build:  step process substep collect files 41000 files_total 41000 duration 3.014767052s"
time=2024-02-27T11:24:36Z | msg="INFO  build:  step process substep collect files 38000 files_total 79000 duration 3.004260141s"
time=2024-02-27T11:24:39Z | msg="INFO  build:  step process substep collect files 33000 files_total 112000 duration 3.028676451s"
time=2024-02-27T11:24:42Z | msg="INFO  build:  step process substep collect files 27870 files_total 139870 duration 2.447407666sINFO  build:  step process duration 11.495399175s"
time=2024-02-27T11:24:49Z | msg="INFO  build:  step assemble duration 7.466807118s"
time=2024-02-27T11:27:08Z | msg="INFO  build:  step render substep pages site en outputFormat html duration 2m18.177601878s"
time=2024-02-27T11:27:08Z | msg="INFO  build:  step render substep pages site en outputFormat csv duration 333.367446ms"
time=2024-02-27T11:27:12Z | msg="INFO  build:  step render substep pages site en outputFormat json duration 3.168703346s"
time=2024-02-27T11:27:13Z | msg="INFO  build:  step render substep pages site en outputFormat rss duration 1.284847777s"
time=2024-02-27T11:29:16Z | msg="INFO  build:  step render substep pages site es outputFormat html duration 2m3.000824066s"
time=2024-02-27T11:29:17Z | msg="INFO  build:  step render substep pages site es outputFormat csv duration 313.333325ms"
time=2024-02-27T11:29:19Z | msg="INFO  build:  step render substep pages site es outputFormat json duration 1.968936201s"
time=2024-02-27T11:29:20Z | msg="INFO  build:  step render substep pages site es outputFormat rss duration 818.80005ms"
time=2024-02-27T11:31:18Z | msg="INFO  build:  step render substep pages site de outputFormat html duration 1m58.147270281s"
time=2024-02-27T11:31:19Z | msg="INFO  build:  step render substep pages site de outputFormat csv duration 333.32989ms"
time=2024-02-27T11:31:21Z | msg="INFO  build:  step render substep pages site de outputFormat json duration 2.148809437s"
time=2024-02-27T11:31:22Z | msg="INFO  build:  step render substep pages site de outputFormat rss duration 781.219217ms"
time=2024-02-27T11:33:18Z | msg="INFO  build:  step render substep pages site ru outputFormat html duration 1m55.313464896s"
time=2024-02-27T11:33:18Z | msg="INFO  build:  step render substep pages site ru outputFormat csv duration 319.735755ms"
time=2024-02-27T11:33:20Z | msg="INFO  build:  step render substep pages site ru outputFormat json duration 1.988445374s"
time=2024-02-27T11:33:21Z | msg="INFO  build:  step render substep pages site ru outputFormat rss duration 778.596956ms"

And here it stalled, even though there were no INFO messages with dynacache GC runs. So, it seems that stalls are happening not only in extreme memory conditions, but randomly anytime GC is run.

TiGR avatar Feb 27 '24 15:02 TiGR

So, it seems that stalls are happening not only in extreme memory conditions, but randomly anytime GC is run.

So, it's most likely a mutex deadlock, but it's not to me inspecting the code, and I need to reproduce it to understand and fix it. But I will try to spend some time on this today.

bep avatar Feb 29 '24 10:02 bep

@TiGR what would be useful is, if you could do a kill -SIGABRT <pid> on a stuck process and post the stack trace (which I suspect would be long), in a comment here or something.

bep avatar Feb 29 '24 15:02 bep

OK, I think I have reproduced this building a beefy Hugo site on a Ubuntu VM with only 1GB of memory (the memory limit set to 256 MB).

I will start by making it possible to set HUGO_MEMORYLIMIT to something other than gigabytes, so I can hopefully reproduce this on my main box (which would make it much easier to debug).

bep avatar Feb 29 '24 16:02 bep

Strike that, it actually completed, just took very long (because of image processing, I guess, faster second run with the cache).

bep avatar Feb 29 '24 17:02 bep

Another reliable way to reproduce stalls is by building big website with HUGO_NUMWORKERMUPTIPLIER=32

TiGR avatar Feb 29 '24 20:02 TiGR

@TiGR what would be useful is, if you could do a kill -SIGABRT <pid> on a stuck process and post the stack trace (which I suspect would be long), in a comment here or something.

hugo-stall-stack-trace.txt

TiGR avatar Mar 01 '24 06:03 TiGR

The issue is still there in hugo v0.124.

TiGR avatar Mar 17 '24 07:03 TiGR

@TiGR sorry, I have been a little busy with other things. I see the deadlock in your stack trace and even if I don't have a failing test for it, the fix looks obvious, so I'm going to apply a fix for that later today. I'm boldly closing this issue from that PR, but if it is still a problem, please shout.

bep avatar Mar 20 '24 10:03 bep

This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.

github-actions[bot] avatar Apr 11 '24 01:04 github-actions[bot]