hugo
hugo copied to clipboard
hugo stalls when hitting hard memory limit
Not sure how to reproduce it, maybe by reducing HUGO_MEMORYLIMIT and taking some huge website to test. In case of our huge website we can normally run hugo build with 8Gb memory limit, but if I run hugo server on the same limit, hugo hits the limit at some point and then all cpu usage drops to 0% and hugo just sits there, without doing a thing (no new files are created too).
Running hugo server --logLevel info
Start building sites …
hugo v0.124.0-DEV+extended linux/amd64 BuildDate=unknown
INFO static: syncing static files to / duration 24.426106ms
INFO build: step process substep collect files 37000 files_total 37000 duration 3.013660535s
INFO build: step process substep collect files 102422 files_total 139422 duration 2.721319973s
INFO build: step process duration 5.735132833s
INFO build: step assemble duration 4.368413267s
INFO build: step render substep pages site en outputFormat html duration 1m14.605222904s
INFO build: step render substep pages site en outputFormat csv duration 439.775463ms
INFO build: step render substep pages site en outputFormat json duration 1.694182069s
INFO build: step render substep pages site en outputFormat rss duration 856.068872ms
INFO dynacache: adjusted partitions' max size evicted 7858 numGC 64 limit 7.68 GB alloc 8.00 GB totalAlloc 83.67 GB
INFO dynacache: adjusted partitions' max size evicted 7237 numGC 65 limit 7.68 GB alloc 8.17 GB totalAlloc 87.76 GB
INFO dynacache: adjusted partitions' max size evicted 6099 numGC 70 limit 7.68 GB alloc 8.09 GB totalAlloc 107.07 GB
INFO dynacache: adjusted partitions' max size evicted 7398 numGC 71 limit 7.68 GB alloc 8.23 GB totalAlloc 107.39 GB
INFO dynacache: adjusted partitions' max size evicted 2218 numGC 71 limit 7.68 GB alloc 7.96 GB totalAlloc 111.00 GB
INFO build: step render substep pages site es outputFormat html duration 1m17.620461596s
INFO build: step render substep pages site es outputFormat csv duration 376.800492ms
INFO build: step render substep pages site es outputFormat json duration 1.446087386s
INFO build: step render substep pages site es outputFormat rss duration 582.231744ms
INFO dynacache: adjusted partitions' max size evicted 8436 numGC 75 limit 7.68 GB alloc 8.16 GB totalAlloc 125.20 GB
INFO dynacache: adjusted partitions' max size evicted 4562 numGC 76 limit 7.68 GB alloc 7.76 GB totalAlloc 128.92 GB
INFO dynacache: adjusted partitions' max size evicted 2837 numGC 77 limit 7.68 GB alloc 8.18 GB totalAlloc 133.44 GB
INFO dynacache: adjusted partitions' max size evicted 3769 numGC 78 limit 7.68 GB alloc 8.42 GB totalAlloc 137.72 GB
INFO dynacache: adjusted partitions' max size evicted 7235 numGC 79 limit 7.68 GB alloc 8.72 GB totalAlloc 142.06 GB
INFO dynacache: adjusted partitions' max size evicted 7577 numGC 80 limit 7.68 GB alloc 8.08 GB totalAlloc 145.92 GB
INFO dynacache: adjusted partitions' max size evicted 8250 numGC 81 limit 7.68 GB alloc 7.89 GB totalAlloc 149.91 GB
INFO dynacache: adjusted partitions' max size evicted 7581 numGC 82 limit 7.68 GB alloc 7.77 GB totalAlloc 153.98 GB
INFO dynacache: adjusted partitions' max size evicted 3157 numGC 84 limit 7.68 GB alloc 8.19 GB totalAlloc 162.63 GB
INFO dynacache: adjusted partitions' max size evicted 1492 numGC 85 limit 7.68 GB alloc 7.99 GB totalAlloc 166.28 GB
INFO dynacache: adjusted partitions' max size evicted 5796 numGC 86 limit 7.68 GB alloc 7.97 GB totalAlloc 167.16 GB
INFO dynacache: adjusted partitions' max size evicted 2363 numGC 86 limit 7.68 GB alloc 8.10 GB totalAlloc 170.69 GB
And here it just hangs. I'd expect it to fail somehow if it has run out of memory.
Also, if I reduce the limit, regular hugo build also stalls:
$ HUGO_MEMORYLIMIT=3 ~/go/bin/hugo --logLevel info
Start building sites …
hugo v0.124.0-DEV+extended linux/amd64 BuildDate=unknown
INFO static: syncing static files to / duration 146.63746ms
INFO build: step process substep collect files 37000 files_total 37000 duration 3.00951775s
INFO build: step process substep collect files 102422 files_total 139422 duration 2.852218241s
INFO build: step process duration 5.861893274s
INFO build: step assemble duration 4.53627309s
INFO dynacache: adjusted partitions' max size evicted 2282 numGC 26 limit 3.00 GB alloc 4.09 GB totalAlloc 7.56 GB
INFO dynacache: adjusted partitions' max size evicted 4109 numGC 27 limit 3.00 GB alloc 3.62 GB totalAlloc 8.89 GB
INFO dynacache: adjusted partitions' max size evicted 3740 numGC 27 limit 3.00 GB alloc 4.76 GB totalAlloc 10.04 GB
INFO dynacache: adjusted partitions' max size evicted 2188 numGC 30 limit 3.00 GB alloc 4.38 GB totalAlloc 16.54 GB
And here it stalled.
I suspect this is a deadlock with the cache mutex (or something), which isn't the easiest thing in the world to debug, but I guess I should be able to trigger this. I have tested this pretty heavily using 1GB limit on some very big sites, but I guess the devil is in the detail.
A temporary workaround, I guess, would be to set the memorylimit to something very high.
I have pushed 564bae06f6513cbca80bd54411f9a66ec2115995 -- which I don't think will fix your problem.
I have spent some time building some pretty beefy sites with HUGO_MEMORYLIMIT=1
, but I haven't been able to force this issue.
Note that I don't doubt that this is a real issue that needs to be fixed, but I need to look closer to reproduce it.
I notice from the log that the amount of memory you allocate is on a complete different level than the sites I'm testing on. Can you elaborate a little what all the memory is used for? Image processing? Do you have a image cache?
It is really hard for me to determine what hugo uses memory for. Yes, we have 2,5 thousands of images, we have around 12 thousands of pages (multiplied by 10 translations). Most of images have previews generated, many of them are additionally resized and dumped as both jpeg and webp, base language build generates 25 thousands images.
Do you have a image cache?
Do you mean resources/_gen
? If so, then yes.
I have pushed 564bae0 -- which I don't think will fix your problem.
It does not, hugo still stalls on master.
We start seeing stalls in our cloud builds (we have hugo 0.123.2 there at this point). We have the same memory limits we had for 122, that is HUGO_MEMORYLIMIT is 22. Hugo during the build do not exceed 11Gb. But we see random unpredictable stalls. I can't reproduce anything like this on my machine, but we have this problem in cloud.
Example build:
time=2024-02-27T11:24:30Z | msg="Start building sites … "
time=2024-02-27T11:24:30Z | msg="hugo v0.123.2-929b91fe75cb0d041f22b4707700dfc117115ad4+extended linux/amd64 BuildDate=2024-02-22T15:27:15Z VendorInfo=gohugoio"
time=2024-02-27T11:24:30Z | msg="INFO static: syncing static files to /"
time=2024-02-27T11:24:30Z | msg=" duration 141.026102ms"
time=2024-02-27T11:24:32Z | msg="INFO dynacache: adjusted partitions' max size evicted 4537 numGC 16 limit 22.00 GB alloc 0.17 GB totalAlloc 0.37 GB"
time=2024-02-27T11:24:33Z | msg="INFO build: step process substep collect files 41000 files_total 41000 duration 3.014767052s"
time=2024-02-27T11:24:36Z | msg="INFO build: step process substep collect files 38000 files_total 79000 duration 3.004260141s"
time=2024-02-27T11:24:39Z | msg="INFO build: step process substep collect files 33000 files_total 112000 duration 3.028676451s"
time=2024-02-27T11:24:42Z | msg="INFO build: step process substep collect files 27870 files_total 139870 duration 2.447407666sINFO build: step process duration 11.495399175s"
time=2024-02-27T11:24:49Z | msg="INFO build: step assemble duration 7.466807118s"
time=2024-02-27T11:27:08Z | msg="INFO build: step render substep pages site en outputFormat html duration 2m18.177601878s"
time=2024-02-27T11:27:08Z | msg="INFO build: step render substep pages site en outputFormat csv duration 333.367446ms"
time=2024-02-27T11:27:12Z | msg="INFO build: step render substep pages site en outputFormat json duration 3.168703346s"
time=2024-02-27T11:27:13Z | msg="INFO build: step render substep pages site en outputFormat rss duration 1.284847777s"
time=2024-02-27T11:29:16Z | msg="INFO build: step render substep pages site es outputFormat html duration 2m3.000824066s"
time=2024-02-27T11:29:17Z | msg="INFO build: step render substep pages site es outputFormat csv duration 313.333325ms"
time=2024-02-27T11:29:19Z | msg="INFO build: step render substep pages site es outputFormat json duration 1.968936201s"
time=2024-02-27T11:29:20Z | msg="INFO build: step render substep pages site es outputFormat rss duration 818.80005ms"
time=2024-02-27T11:31:18Z | msg="INFO build: step render substep pages site de outputFormat html duration 1m58.147270281s"
time=2024-02-27T11:31:19Z | msg="INFO build: step render substep pages site de outputFormat csv duration 333.32989ms"
time=2024-02-27T11:31:21Z | msg="INFO build: step render substep pages site de outputFormat json duration 2.148809437s"
time=2024-02-27T11:31:22Z | msg="INFO build: step render substep pages site de outputFormat rss duration 781.219217ms"
time=2024-02-27T11:33:18Z | msg="INFO build: step render substep pages site ru outputFormat html duration 1m55.313464896s"
time=2024-02-27T11:33:18Z | msg="INFO build: step render substep pages site ru outputFormat csv duration 319.735755ms"
time=2024-02-27T11:33:20Z | msg="INFO build: step render substep pages site ru outputFormat json duration 1.988445374s"
time=2024-02-27T11:33:21Z | msg="INFO build: step render substep pages site ru outputFormat rss duration 778.596956ms"
And here it stalled, even though there were no INFO messages with dynacache GC runs. So, it seems that stalls are happening not only in extreme memory conditions, but randomly anytime GC is run.
So, it seems that stalls are happening not only in extreme memory conditions, but randomly anytime GC is run.
So, it's most likely a mutex deadlock, but it's not to me inspecting the code, and I need to reproduce it to understand and fix it. But I will try to spend some time on this today.
@TiGR what would be useful is, if you could do a kill -SIGABRT <pid>
on a stuck process and post the stack trace (which I suspect would be long), in a comment here or something.
OK, I think I have reproduced this building a beefy Hugo site on a Ubuntu VM with only 1GB of memory (the memory limit set to 256 MB).
I will start by making it possible to set HUGO_MEMORYLIMIT
to something other than gigabytes, so I can hopefully reproduce this on my main box (which would make it much easier to debug).
Strike that, it actually completed, just took very long (because of image processing, I guess, faster second run with the cache).
Another reliable way to reproduce stalls is by building big website with HUGO_NUMWORKERMUPTIPLIER=32
@TiGR what would be useful is, if you could do a
kill -SIGABRT <pid>
on a stuck process and post the stack trace (which I suspect would be long), in a comment here or something.
The issue is still there in hugo v0.124.
@TiGR sorry, I have been a little busy with other things. I see the deadlock in your stack trace and even if I don't have a failing test for it, the fix looks obvious, so I'm going to apply a fix for that later today. I'm boldly closing this issue from that PR, but if it is still a problem, please shout.
This issue has been automatically locked since there has not been any recent activity after it was closed. Please open a new issue for related bugs.