hackage-server Slow HTTP responses when the cache is updating

I have an internal hackage-server running, which is deployed with a http based health checker that does a GET request to / to see whether it's up.

The hackage-server was periodically getting marked as down and looking at the logs it seems this is because it's taking too long to respond while it's doing the cache updates. We could start to use a health checker that doesn't use http, but really it shouldn't be taking this long to respond in the first place. Logs look like this:

127.0.0.1 - [10/Oct/2022:14:18:54 +0000] "GET /00-index.tar.gz 1.1" 301 -1 "" "cabal-install/3.2.0.0 (linux; x86_64)"
127.0.0.1 - [10/Oct/2022:14:18:54 +0000] "GET /packages/index.tar.gz 1.1" 304 -1 "" "cabal-install/3.2.0.0 (linux; x86_64)"
127.0.0.1 - [10/Oct/2022:14:18:54 +0000] "GET /packages/docs.json 1.1" 200 -1 "" "hackage-build/0.5.1"
127.0.0.1 - [10/Oct/2022:14:18:55 +0000] "GET /packages/candidates/docs.json 1.1" 200 -1 "" "hackage-build/0.5.1"
127.0.0.1 - [10/Oct/2022:14:19:01 +0000] "GET / 1.0" 200 -1 "" "bdns_healthchecker/2022.922.234"
Cache 'packages index page (by category)' updated (item update). time: 12.955302227s
Cache 'packages index page (by name)' updated (item update). time: 0.81750265s
Cache 'browse packages' updated (item update). time: 12.250529765s
127.0.0.1 - [10/Oct/2022:14:19:33 +0000] "GET / 1.0" 200 -1 "" "healthchecker"
HTTP request failed with: Network.Socket.sendBuf: resource vanished (Broken pipe)
127.0.0.1 - [10/Oct/2022:14:19:33 +0000] "GET / 1.0" 200 -1 "" "healthchecker"
127.0.0.1 - [10/Oct/2022:14:19:33 +0000] "GET / 1.0" 200 -1 "" "healthchecker"
HTTP request failed with: Network.Socket.sendBuf: resource vanished (Broken pipe)
127.0.0.1 - [10/Oct/2022:14:19:41 +0000] "GET / 1.0" 200 -1 "" "healthchecker"

This seems to be reproducable locally on master, once the server is mirroring enough packages, by just periodically running a curl request with a 0.2s timeout. Given it is a local server, 0.2s should be enough, and by the time it's mirroring the whole of hackage I imagine it starts to get similarly long delays.

Having a look at the code that's causing this, it seems that the AsyncCache is not updating as asynchronously as you'd hope, or something else is blocking waiting on this result. Given that the cache update should be running in separate forkIO created thread I would guess the latter. Happy to have a go at trying to fix this, but I am not entirely sure the exact cause of the issue so may need a pointer.

Oct 10 '22 16:10 RaoulHC

Shot in the dark, but are you running with enough OS capabilities allocated?

Oct 10 '22 22:10 gbaz

It's running on a more than powerful enough server if that's what you mean. Even if that was the reason I don't think the cache update should be blocking these get requests.

Oct 10 '22 23:10 RaoulHC

I mean -- are you passing it an argument that allocates it multiple cores, or is it linked against the concurrent runtime but only on one OS core?

Oct 11 '22 00:10 gbaz

Ah in which case yes, this is build with the default cabal options so should be built with -threaded -rtsopts -with-rtsopts=-I00 (plus the various options for the library compilation), and I can see in process manager that hackage-server is getting 8 OS threads.

Oct 11 '22 09:10 RaoulHC

hackage-server hackage-server copied to clipboard

Slow HTTP responses when the cache is updating

hackage-server
hackage-server copied to clipboard