prometheus-haskell
prometheus-haskell copied to clipboard
Excessive GC usage for `Summary` metric in wai-prometheus-middleware
We have a WAI application in production that serves only around 5 reqs/s and we had serious performance issues as the server was GC'ing 95% of the time.
Before
You can see that we're spending almost all clock cycles on garbage collection, and garbage collection happens often

Profiler output
After profiling, the culprit turned out to be the summaries that are kept for each request in wai-prometheus-middleware
Tue Jul 4 11:17 2017 Time and Allocation Profiling Report (Final)
warpmachine +RTS -N -T -p -RTS --proxy-host localhost --proxy-port 8000
total time = 17.73 secs (17725 ticks @ 1000 us, 1 processor)
total alloc = 10,233,152,016 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
invariant.fj Prometheus.Metric.Summary src/Prometheus/Metric/Summary.hs:(197,9)-(198,58) 17.9 13.1
wrapCompile.\ Text.Regex.Posix.Wrap Text/Regex/Posix/Wrap.hsc:(480,49)-(484,42) 7.9 0.0
throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal Network/Socket/Internal.hsc:(161,1)-(162,53) 6.2 0.6
getAddrInfo.\.\.\.\ Network.Socket Network/Socket.hsc:(1460,36)-(1470,43) 5.4 0.0
connect.\.\.connectLoop Network.Socket Network/Socket.hsc:(444,9)-(458,29) 4.7 0.0
throwSocketErrorIfMinus1_ Network.Socket.Internal Network/Socket/Internal.hsc:166:1-47 3.0 0.0
invariant Prometheus.Metric.Summary src/Prometheus/Metric/Summary.hs:(193,1)-(198,58) 2.4 6.0
wrapTest.\ Text.Regex.Posix.Wrap Text/Regex/Posix/Wrap.hsc:(489,45)-(495,40) 1.9 0.0
compress.compressPair.inv Prometheus.Metric.Summary src/Prometheus/Metric/Summary.hs:172:17-60 1.8 5.1
connect.\.\.connectBlocked Network.Socket Network/Socket.hsc:(460,9)-(465,67) 1.6 0.4
compress Prometheus.Metric.Summary src/Prometheus/Metric/Summary.hs:(156,1)-(172,60) 1.3 3.9
decodeWithTable/fill Data.ByteString.Base64.Internal Data/ByteString/Base64/Internal.hs:(171,62)-(194,68) 1.1 2.3
encode/fill Data.ByteString.Base64.Internal Data/ByteString/Base64/Internal.hs:(61,55)-(69,52) 1.1 2.6
urlDecode Network.HTTP.Types.URI Network/HTTP/Types/URI.hs:(209,1)-(228,34) 1.0 2.3
peek8_32 Data.ByteString.Base64.Internal Data/ByteString/Base64/Internal.hs:43:1-36 0.9 2.8
decodeWithTable.\.\.look Data.ByteString.Base64.Internal Data/ByteString/Base64/Internal.hs:(165,11)-(168,49) 0.8 2.8
asCString Text.Regex.Posix.ByteString Text/Regex/Posix/ByteString.hs:(150,1)-(152,40) 0.5 1.1
memConstEqual.loop Data.Memory.PtrMethods Data/Memory/PtrMethods.hs:(103,5)-(107,34) 0.5 1.7
toByteString Blaze.ByteString.Builder Blaze/ByteString/Builder.hs:151:1-46 0.3 1.4
urlDecode.go Network.HTTP.Types.URI Network/HTTP/Types/URI.hs:(211,5)-(221,40) 0.2 1.5
toByteStringIOWith.getBuffer Blaze.ByteString.Builder Blaze/ByteString/Builder.hs:(187,5)-(204,40) 0.2 10.8
toByteStringIOWith Blaze.ByteString.Builder Blaze/ByteString/Builder.hs:(184,1)-(204,40) 0.2 10.7
recv Network.Socket.ByteString Network/Socket/ByteString.hsc:(226,1)-(231,62) 0.1 3.3
After disabling wai-prometheus-middleware
We commented out the middleware in our application, and instead of using 400% CPU, we're back to 2% CPU, and the GC contribution is only 0.02 instead of 1.0. Also see the graph below. On the left is before removing the middleware, and on the right is after removing the middleware.

I actually see now that you moved away from Summary to Histogram in the newest release! https://github.com/fimad/prometheus-haskell/commit/8d8a8041f229d83afb6ad58d48c947c230ec50f2
I'll check if this fixes our performance issues.
However, it does mean there seems to be something seriously wrong with the Summary metric which is creating and releasing alot of thunks at the same time, causing the GC to go nuts.
Did you get a chance to check if Histogram solved the GC issues?
Given that we've moved the automatically generated metrics to histograms and the other recently reported problems with summary (#19, #14) I think it would make sense to deprecate / remove the summary metric type.
The impact of the metrics using histograms is much smaller than the impact using summaries. Some measurements we made last week. I think this one can be closed (CC @arianvp):
Without per-request metrics
Mon Jul 10 10:58 2017 Time and Allocation Profiling Report (Final)
warpmachine +RTS -N -T -I0 -qg -A128m -p -RTS --proxy-host localhost --proxy-port 8000 --log-disabled
total time = 2.03 secs (2033 ticks @ 1000 us, 1 processor)
total alloc = 441,122,656 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
throwSocketErrorIfMinus1_ Network.Socket.Internal Network/Socket/Internal.hsc:166:1-47 16.3 0.1
MAIN MAIN <built-in> 15.0 8.6
throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal Network/Socket/Internal.hsc:(161,1)-(162,53) 14.8 3.4
receiveloop Network.Wai.Handler.Warp.Recv Network/Wai/Handler/Warp/Recv.hs:(115,1)-(129,20) 4.6 2.4
socketConnection Network.Wai.Handler.Warp.Run Network/Wai/Handler/Warp/Run.hs:(59,1)-(73,7) 3.6 1.4
close.\ Network.Socket Network/Socket.hsc:(1119,4)-(1124,80) 2.6 2.1
receive Network.Wai.Handler.Warp.Recv Network/Wai/Handler/Warp/Recv.hs:(96,1)-(99,48) 1.7 1.7
parseRequestLine.\ Network.Wai.Handler.Warp.RequestHeader Network/Wai/Handler/Warp/RequestHeader.hs:(60,80)-(90,33) 1.5 0.6
wrapTest.\ Text.Regex.Posix.Wrap Text/Regex/Posix/Wrap.hsc:(489,45)-(495,40) 1.4 0.0
serveConnection Network.Wai.Handler.Warp.Run Network/Wai/Handler/Warp/Run.hs:(324,1)-(482,47) 1.3 2.4
sendResponse Network.Wai.Handler.Warp.Response Network/Wai/Handler/Warp/Response.hs:(138,1)-(186,35) 1.3 1.5
serveConnection.processRequest Network.Wai.Handler.Warp.Run Network/Wai/Handler/Warp/Run.hs:(426,5)-(482,47) 1.1 0.6
asCString Text.Regex.Posix.ByteString Text/Regex/Posix/ByteString.hs:(150,1)-(152,40) 1.0 5.7
foldCaseBS.toLower8' Data.CaseInsensitive.Internal Data/CaseInsensitive/Internal.hs:(190,7)-(194,37) 0.9 6.3
push Network.Wai.Handler.Warp.Request Network/Wai/Handler/Warp/Request.hs:(230,1)-(291,48) 0.8 4.4
serveConnection.http1 Network.Wai.Handler.Warp.Run Network/Wai/Handler/Warp/Run.hs:(405,5)-(424,96) 0.8 2.3
defaultSettings Network.Wai.Handler.Warp.Settings Network/Wai/Handler/Warp/Settings.hs:(126,1)-(149,5) 0.8 4.3
recvRequest Network.Wai.Handler.Warp.Request Network/Wai/Handler/Warp/Request.hs:(68,1)-(106,54) 0.8 3.3
push.mnl Network.Wai.Handler.Warp.Request Network/Wai/Handler/Warp/Request.hs:(237,5)-(249,28) 0.6 4.4
traverseHeader Network.Wai.Handler.Warp.Header Network/Wai/Handler/Warp/Header.hs:(89,1)-(98,26) 0.5 1.6
parseHeader Network.Wai.Handler.Warp.RequestHeader Network/Wai/Handler/Warp/RequestHeader.hs:(131,1)-(134,24) 0.5 1.7
parseHeader.(...) Network.Wai.Handler.Warp.RequestHeader Network/Wai/Handler/Warp/RequestHeader.hs:132:9-37 0.4 1.9
acceptConnection.acceptLoop Network.Wai.Handler.Warp.Run Network/Wai/Handler/Warp/Run.hs:(226,5)-(242,26) 0.4 1.5
foldCaseBS Data.CaseInsensitive.Internal Data/CaseInsensitive/Internal.hs:(187,1)-(194,37) 0.2 2.5
composeHeader Network.Wai.Handler.Warp.ResponseHeader Network/Wai/Handler/Warp/ResponseHeader.hs:(21,1)-(28,45) 0.2 1.2
parseHeaderLines.hdr Network.Wai.Handler.Warp.RequestHeader Network/Wai/Handler/Warp/RequestHeader.hs:38:9-40 0.2 2.0
With per-request metrics (version 0.1.1, using summaries)
Mon Jul 10 11:01 2017 Time and Allocation Profiling Report (Final)
warpmachine +RTS -N -T -I0 -qg -A128m -p -RTS --proxy-host localhost --proxy-port 8000 --log-disabled
total time = 8.08 secs (8082 ticks @ 1000 us, 1 processor)
total alloc = 4,762,860,168 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
invariant.fj Prometheus.Metric.Summary src/Prometheus/Metric/Summary.hs:(197,9)-(198,58) 57.2 42.3
invariant Prometheus.Metric.Summary src/Prometheus/Metric/Summary.hs:(193,1)-(198,58) 7.5 18.0
compress.compressPair.inv Prometheus.Metric.Summary src/Prometheus/Metric/Summary.hs:172:17-60 5.4 15.4
throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal Network/Socket/Internal.hsc:(161,1)-(162,53) 3.9 0.3
compress Prometheus.Metric.Summary src/Prometheus/Metric/Summary.hs:(156,1)-(172,60) 3.7 10.9
throwSocketErrorIfMinus1_ Network.Socket.Internal Network/Socket/Internal.hsc:166:1-47 3.6 0.0
MAIN MAIN <built-in> 3.3 0.7
receiveloop Network.Wai.Handler.Warp.Recv Network/Wai/Handler/Warp/Recv.hs:(115,1)-(129,20) 1.2 0.2
compress.compressPair Prometheus.Metric.Summary src/Prometheus/Metric/Summary.hs:(166,9)-(172,60) 0.9 1.2
invariant.n Prometheus.Metric.Summary src/Prometheus/Metric/Summary.hs:196:9-30 0.4 2.0
With per-request metrics (version 0.2.0, using histograms)
Mon Jul 10 10:35 2017 Time and Allocation Profiling Report (Final)
warpmachine +RTS -N -T -I0 -qg -A128m -p -RTS --proxy-host localhost --proxy-port 8000
total time = 1.81 secs (1815 ticks @ 1000 us, 1 processor)
total alloc = 92,676,640 bytes (excludes profiling overheads)
COST CENTRE MODULE SRC %time %alloc
MAIN MAIN <built-in> 13.9 5.2
throwSocketErrorIfMinus1RetryMayBlock Network.Socket.Internal Network/Socket/Internal.hsc:(161,1)-(162,53) 13.4 2.0
throwSocketErrorIfMinus1_ Network.Socket.Internal Network/Socket/Internal.hsc:166:1-47 8.3 0.1
mkRequestLogger.(...) Network.Wai.Middleware.RequestLogger Network/Wai/Middleware/RequestLogger.hs:(93,9)-(97,44) 5.1 1.4
receiveloop Network.Wai.Handler.Warp.Recv Network/Wai/Handler/Warp/Recv.hs:(115,1)-(129,20) 3.6 1.5
allocaAndPeek.\ System.Clock System/Clock.hsc:166:36-52 2.7 0.0
observeSeconds.latency Network.Wai.Middleware.Prometheus src/Network/Wai/Middleware/Prometheus.hs:94:9-96 1.9 1.4
close.\ Network.Socket Network/Socket.hsc:(1119,4)-(1124,80) 1.7 1.3
apacheLogStr Network.Wai.Logger.Apache Network/Wai/Logger/Apache.hs:(45,1)-(70,36) 1.6 1.5
receive Network.Wai.Handler.Warp.Recv Network/Wai/Handler/Warp/Recv.hs:(96,1)-(99,48) 1.5 1.0
toLogStr System.Log.FastLogger.LogStr System/Log/FastLogger/LogStr.hs:75:5-54 1.2 1.9
asCString Text.Regex.Posix.ByteString Text/Regex/Posix/ByteString.hs:(150,1)-(152,40) 1.1 3.4
wrapTest.\ Text.Regex.Posix.Wrap Text/Regex/Posix/Wrap.hsc:(489,45)-(495,40) 1.1 0.0
fromBuilder System.Log.FastLogger.LogStr System/Log/FastLogger/LogStr.hs:48:1-46 1.1 9.8
serveConnection.http1 Network.Wai.Handler.Warp.Run Network/Wai/Handler/Warp/Run.hs:(405,5)-(424,96) 1.0 1.4
mappend System.Log.FastLogger.LogStr System/Log/FastLogger/LogStr.hs:64:5-69 1.0 4.9
serveConnection Network.Wai.Handler.Warp.Run Network/Wai/Handler/Warp/Run.hs:(324,1)-(482,47) 0.8 1.4
defaultSettings Network.Wai.Handler.Warp.Settings Network/Wai/Handler/Warp/Settings.hs:(126,1)-(149,5) 0.8 2.5
recvRequest Network.Wai.Handler.Warp.Request Network/Wai/Handler/Warp/Request.hs:(68,1)-(106,54) 0.8 2.0
encodeUtf8Builder.\ Data.Text.Lazy.Encoding Data/Text/Lazy/Encoding.hs:164:26-59 0.7 1.2
push Network.Wai.Handler.Warp.Request Network/Wai/Handler/Warp/Request.hs:(230,1)-(291,48) 0.7 2.6
toLogStr System.Log.FastLogger.LogStr System/Log/FastLogger/LogStr.hs:79:5-33 0.7 2.4
encodeUtf8 Data.Text.Lazy.Encoding Data/Text/Lazy/Encoding.hs:(150,1)-(160,70) 0.6 2.4
foldCaseBS Data.CaseInsensitive.Internal Data/CaseInsensitive/Internal.hs:(187,1)-(194,37) 0.4 1.5
foldCaseBS.toLower8' Data.CaseInsensitive.Internal Data/CaseInsensitive/Internal.hs:(190,7)-(194,37) 0.4 3.7
showIPv4 Network.Wai.Logger.IP Network/Wai/Logger/IP.hs:(15,1)-(26,26) 0.4 2.0
parseHeaderLines.hdr Network.Wai.Handler.Warp.RequestHeader Network/Wai/Handler/Warp/RequestHeader.hs:38:9-40 0.3 1.2
parseHeader.(...) Network.Wai.Handler.Warp.RequestHeader Network/Wai/Handler/Warp/RequestHeader.hs:132:9-37 0.3 1.1
push.mnl Network.Wai.Handler.Warp.Request Network/Wai/Handler/Warp/Request.hs:(237,5)-(249,28) 0.3 2.6
parseHeader Network.Wai.Handler.Warp.RequestHeader Network/Wai/Handler/Warp/RequestHeader.hs:(131,1)-(134,24) 0.2 1.0
show Network.HTTP.Types.Version Network/HTTP/Types/Version.hs:24:5-79 0.1 1.5
encodeUtf8.strategy Data.Text.Lazy.Encoding Data/Text/Lazy/Encoding.hs:160:5-70 0.1 1.1
@fimad it seems like a good idea to deprecate Summary. As you can see, moving away from it fixed our issues :)
To add another datapoint to the discussion, I'm working on a new library (axman6/servant-prometheus) which is a rewrite of the servant-ekg package. Once I added summaries to the metrics for each endpoint, the benchmarks went from handling 40k+ req/sec to ~5k.
I have been looking at the summary code and believe the main culprit is the use of Rationals. I did a quick test of making all Rationals Doubles and performance jumped back to about 38k req/sec, which appears to be quite an acceptable overhead for most applications (the benchmark app simple returns a string from its endpoint, so the percentage overhead from monitoring is going to be a lot higher than real world apps).
Is there a particularly good justification for using Rationals? I think the summaries are a very useful metric, which don't require an arbitrary decision a priori about which histogram bins make sense for a given application. I would like to see summary and histograms kept because they serve different purposes, but at the moment the overhead is too great. I've provided a way in servant-prometheus to disable summary calculation from an app currently, but it would be nice not to.
I think the original reason I used Rationals was because I was having trouble getting the implementation to maintain the invariants that the algorithm was suppose to when testing and wanted to eliminate rounding errors as a cause.
Given the stark difference in performance, I think switching to Doubles would be reasonable.
A possible middle ground might be using something like Ed Kmett’s compensated library to get double doubles - this should be fast but also give you the accuracy I expect you probably really want for this sort of metric
We've recently had some issues with higher than expected CPU load on a couple of our servers.
After reproducing the problem locally and profiling with GHC, our two largest cost centers are:
COST CENTRE MODULE SRC %time %alloc
compress Prometheus.Metric.Summary src/Prometheus/Metric/Summary.hs:(153,1)-(169,60) 51.1 79.6
invariant Prometheus.Metric.Summary src/Prometheus/Metric/Summary.hs:(190,1)-(195,58) 10.4 3.6
We're currently primarily collecting summary metrics, so it seems like it's related to this old ticket. We can transition more metrics to histograms instead, but it would be nice if there's a simple way to improve the performance of summary metrics too.
@fimad or @axman6 did either of you ever have success transitioning from Rational to Double? I played around with it a little bit but I didn't see a noticeable change to the existing benchmark times. It also seems like it causes the tests to fail in the way @fimad mentioned.
I'd be happy to dive in here more and open a PR if anyone has ideas that can point me in the right direction.
This should be fixed as of #62 which is available in version 1.1.0 thanks to @iand675.