servant
servant copied to clipboard
compile time quadratic in number of routes, with -O2
I've been digging a bit to understand where time is spent while building my $WORK codebase, and found this interesting behavior. I'm hoping others interested in investigating the root cause will discuss here. I'm open to changes in application code, servant, or GHC, if we can narrow the cause enough.
The small examples I'm using for benchmarking are at https://github.com/bergey/studious-telegram/tree/trunk/servant
I'm timing with bench invoked as bench 'stack clean && stack build
. I'm running on a GCP VM with nothing else running on it.
With stack build --fast
(which I believe passes -O0 to GHC), the build time doesn't change dramatically as the number of HTTP routes increases. Without --fast
:
number of routes | time (seconds) |
---|---|
1 | 4 |
8 | 4.8 |
16 | 7 |
32 | 16.3 |
64 | 74 |
The only mitigation I've found so far is testing as much as possible in GHCi or with --fast
. This may be related to #870; I have not yet tested with anything except stackage-11.14, which has GHC-8.2.2 and servant-0.13.0.1.
I think we've seen a couple of similar issues. A rather efficient workaround is to split up the API and servers in several modules (which most people tend to do naturally anyways, when web apps become that large), as things suddenly become a lot less overwhelming for GHC.
But yes, any information would be useful. If you can try other servant & GHC versions, that would be lovely.
This is however definitely annoying and I would personally be delighted to see any improvement there, may it come from servant's code that would put less pressure on GHC, or from GHC dealing with this type of code more efficiently. The first thing to do though, for anyone interested in investigating this, to figure out what's going on and where all that time is spent, would be to run a profiling-enabled GHC over your module.
I tested with a few other versions.
LTS-9.21 - servant-0.11, GHC-8.0.1
LTS-6.35 - servant-0.7.1, GHC-7.10.3
I'd like to try other servant / ghc combinations, but these were the fastest to test.
number of routes | LTS-6.35 | LTS-9.21 | LTS-11.14 |
---|---|---|---|
1 | 4.1 | 5.1 | 4.04 |
8 | 4.4 | 5.1 | 4.86 |
16 | 4.6 | 5.6 | 7.02 |
32 | 6.7 | 6.7 | 16.34 |
64 | 13.5 | 11.3 | 74 |
128 | 60 | 44.4 |
So it's been quadratic for some time, it seems as though it got substantially worse either in GHC 8.2 or since servant-0.13. I'll take a look at which next time I work on this - and also on the profiling-enabled GHC.
My production, non-benchmark code base does indeed split the routes across modules. I tried to measure some difference from that earlier today, but perhaps I didn't have enough routes to see an effect. It's good to hear that splitting is helpful.
cabal new-build servant-bench8 6,25s user 0,80s system 100% cpu 6,999 total
cabal new-build servant-bench16 7,23s user 0,88s system 100% cpu 8,089 total
cabal new-build servant-bench32 12,14s user 0,77s system 100% cpu 12,870 total:1
cabal new-build servant-bench64 43,82s user 1,22s system 99% cpu 45,050 total
Building with -ddump-to-file -ddump-simpl-stats
16
1770 UnfoldingDone
287 Servant.Server.Internal.$fHasServerTYPE:<|>context_$choistServerWithContext
263 $dHasServer_ahv6
32
5282 UnfoldingDone
1335 Servant.Server.Internal.$fHasServerTYPE:<|>context_$choistServerWithContext
783 $dHasServer_ahxq
64
17682 UnfoldingDone
5735 Servant.Server.Internal.$fHasServerTYPE:<|>context_$choistServerWithContext
2591 $dHasServer_ahC4
hoist?
As we inline HasServer
hoistServerWithContext
s are generated, though we don't use them.
GHC is silly here.
This reminds me of https://ghc.haskell.org/trac/ghc/ticket/13253, particularly something like: https://ghc.haskell.org/trac/ghc/ticket/13253#comment:22
There's also a patch in https://ghc.haskell.org/trac/ghc/ticket/13253#comment:24, if alpmestan could try to apply it on top of 8.4.3 (which also has the same behaviour), and see if helps here?
This is still a problem in GHC 8.6.3.
This example takes me 60 sec to build with only 20 routes. Running client
against the routes individually reduces that to 37 sec (as we still pass serve
the entire API).
I looked a bit more into this. I generated a few modules that look like:
{-# LANGUAGE DataKinds, TypeOperators, TypeFamilies #-}
module API_2 where
import Data.Proxy
import Servant
type API = "1" :> Get '[JSON] String :<|> "2" :> Get '[JSON] String
api = Proxy :: Proxy API
server :: Server API
server = return "1" :<|> return "2"
app = serve api server
where I'm varying the number of endpoints from a module to another. I have APIs with 1,2,3,4,5,10,20,30,40,50 endpoints. And I run this little script:
#! /usr/bin/env sh
rm -f API_*
for i in 1 2 3 4 5 10 20 30 40 50; do
runghc gen_api.hs $i
time ghc -c --make API_$i.hs -o API_$i.o \
-freduction-depth=0 -O -ddump-timings -v2 \
-ddump-simpl -ddump-simpl-stats -ddump-to-file \
-dsuppress-all \
+RTS -s -RTS |& tee API_$i.log.txt
done
# of endpoints | Compile time (s) |
---|---|
1 | 2.101 |
2 | 2.321 |
3 | 3.301 |
4 | 3.591 |
5 | 4.371 |
10 | 7.961 |
20 | 28.101 |
30 | 58.111 |
40 | 98.181 |
50 | 139.021 |
I can see some interesting things in the log. For example, if we look at the Core optimisation stats for the 50-endpoints API...
We see (see the bottom of this comment) that parsing/renaming/typechecking/desugaring don't take very long, and that after desugaring, the code is not too big yet:
Result size of Desugar (before optimization)
= {terms: 942, types: 39,926, coercions: 18,615, joins: 0/161}
Result size of Desugar (after optimization)
= {terms: 828, types: 39,801, coercions: 18,815, joins: 0/154}
The simplifier runs on that, the result size doesn't change all that much. But then, boom, we specialise:
Result size of Simplifier
= {terms: 1,060, types: 29,382, coercions: 38,442, joins: 0/1}
!!! Simplifier [API_50]: finished in 498.86 milliseconds, allocated 46.616 megabytes
*** Specialise [API_50]:
Result size of Specialise
= {terms: 15,554, types: 810,695, coercions: 85,830, joins: 0/3}
!!! Specialise [API_50]: finished in 333.94 milliseconds, allocated 43.001 megabytes
We float out:
*** Float out(FOS {Lam = Just 0,
Consts = True,
OverSatApps = False}) [API_50]:
Result size of Float out(FOS {Lam = Just 0,
Consts = True,
OverSatApps = False})
= {terms: 29,526, types: 1,215,461, coercions: 85,830, joins: 0/0}
!!! Float out(FOS {Lam = Just 0,
Consts = True,
OverSatApps = False}) [API_50]: finished in 1163.50 milliseconds, allocated 133.012 megabytes
and try to simplify:
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
= {terms: 29,138, types: 1,324,321, coercions: 111,537, joins: 0/0}
Result size of Simplifier iteration=2
= {terms: 144,992,
types: 5,290,053,
coercions: 1,377,705,
joins: 0/0}
Result size of Simplifier iteration=3
= {terms: 144,792,
types: 5,286,303,
coercions: 1,370,505,
joins: 0/0}
Result size of Simplifier
= {terms: 144,792,
types: 5,286,303,
coercions: 1,370,505,
joins: 0/0}
!!! Simplifier [API_50]: finished in 31122.77 milliseconds, allocated 4832.778 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
= {terms: 144,678,
types: 5,285,913,
coercions: 1,370,505,
joins: 0/0}
Result size of Simplifier
= {terms: 144,678,
types: 5,285,913,
coercions: 1,370,505,
joins: 0/0}
!!! Simplifier [API_50]: finished in 14560.75 milliseconds, allocated 2147.932 megabytes
*** Simplifier [API_50]:
Result size of Simplifier
= {terms: 144,678,
types: 5,285,913,
coercions: 1,370,505,
joins: 0/0}
!!! Simplifier [API_50]: finished in 6388.17 milliseconds, allocated 1072.456 megabytes
and from that point onwards, we're working with thousands and thousands of terms, millions of types and coercions. Many passes take whole seconds and gigabytes of allocations, until...
*** Demand analysis [API_50]:
Result size of Demand analysis
= {terms: 21,880,
types: 2,231,518,
coercions: 1,294,470,
joins: 0/0}
!!! Demand analysis [API_50]: finished in 5560.72 milliseconds, allocated 2786.071 megabytes
*** CoreTidy [API_50]:
Result size of Tidy Core
= {terms: 3,507, types: 134,928, coercions: 39,021, joins: 0/0}
!!! CoreTidy [API_50]: finished in 1105.27 milliseconds, allocated 349.239 megabytes
Created temporary directory: /run/user/1001/ghc11842_0
*** CorePrep [API_50]:
Result size of CorePrep
= {terms: 3,732, types: 134,408, coercions: 39,021, joins: 0/102}
!!! CorePrep [API_50]: finished in 39.19 milliseconds, allocated 6.130 megabytes
*** Stg2Stg:
*** CodeGen [API_50]:
!!! CodeGen [API_50]: finished in 760.67 milliseconds, allocated 146.051 megabytes
Core Tidy really deserves its name. Here are the timings & allocs for all the passes:
Pass | Time (ms) | Alloc (bytes) |
---|---|---|
Parser | 7.068 | 2554992 |
Renamer/typechecker | 731.063 | 109437544 |
Desugar | 91.337 | 12701224 |
Simplifier | 498.861 | 48880936 |
Specialise | 333.945 | 45090256 |
Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}) | 1163.496 | 139472808 |
Simplifier | 31122.767 | 5067534720 |
Simplifier | 14560.745 | 2252270040 |
Simplifier | 6388.172 | 1124551928 |
Float inwards | 660.780 | 145896080 |
Called arity analysis | 617.060 | 97280936 |
Simplifier | 7205.292 | 1124552056 |
Demand analysis | 7360.820 | 3396416952 |
Worker Wrapper binds | 623.287 | 46714320 |
Simplifier | 24290.701 | 3739472320 |
Exitification transformation | 612.117 | 25169256 |
Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) | 10644.614 | 1489902464 |
Common sub-expression | 3329.560 | 682305136 |
Float inwards | 372.326 | 26302928 |
Simplifier | 10672.911 | 1871560408 |
Demand analysis | 5560.725 | 2921407264 |
CoreTidy | 1105.268 | 366203432 |
CorePrep | 39.191 | 6427320 |
CodeGen | 760.668 | 153145192 |
I'm not going to look at the Core for API_50
...
$ wc -l API_*.dump-simpl
944 API_10.dump-simpl
219 API_1.dump-simpl
2134 API_20.dump-simpl
281 API_2.dump-simpl
3724 API_30.dump-simpl
349 API_3.dump-simpl
5714 API_40.dump-simpl
422 API_4.dump-simpl
8104 API_50.dump-simpl
499 API_5.dump-simpl
but let's see what we get for API_2
.
==================== Tidy Core ====================
Result size of Tidy Core
= {terms: 243, types: 840, coercions: 573, joins: 0/0}
-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
$s$fHasServerTYPE:<|>context_$croute2
$s$fHasServerTYPE:<|>context_$croute2 = "2"#
-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
$s$fHasServerTYPE:<|>context_$croute1
$s$fHasServerTYPE:<|>context_$croute1
= unpackCString# $s$fHasServerTYPE:<|>context_$croute2
-- RHS size: {terms: 2, types: 3, coercions: 0, joins: 0/0}
lvl_r9QC
lvl_r9QC = Right $s$fHasServerTYPE:<|>context_$croute1
-- RHS size: {terms: 4, types: 12, coercions: 0, joins: 0/0}
app6
app6 = \ s_a6U0 -> (# s_a6U0, lvl_r9QC #)
-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
$s$fHasServerTYPE:<|>context_$croute5
$s$fHasServerTYPE:<|>context_$croute5 = "1"#
-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
$s$fHasServerTYPE:<|>context_$croute4
$s$fHasServerTYPE:<|>context_$croute4
= unpackCString# $s$fHasServerTYPE:<|>context_$croute5
-- RHS size: {terms: 2, types: 3, coercions: 0, joins: 0/0}
lvl1_r9QD
lvl1_r9QD = Right $s$fHasServerTYPE:<|>context_$croute4
-- RHS size: {terms: 4, types: 12, coercions: 0, joins: 0/0}
app7
app7 = \ s_a6U0 -> (# s_a6U0, lvl1_r9QD #)
-- RHS size: {terms: 3, types: 6, coercions: 36, joins: 0/0}
app5
app5 = :<|> (app7 `cast` <Co:18>) (app6 `cast` <Co:18>)
-- RHS size: {terms: 1, types: 0, coercions: 98, joins: 0/0}
server
server = app5 `cast` <Co:98>
-- RHS size: {terms: 2, types: 5, coercions: 98, joins: 0/0}
app_result
app_result = Route (app5 `cast` <Co:98>)
-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
$trModule2
$trModule2 = "API_2"#
-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
$trModule1
$trModule1 = TrNameS $trModule2
-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
$trModule4
$trModule4 = "main"#
-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
$trModule3
$trModule3 = TrNameS $trModule4
-- RHS size: {terms: 3, types: 0, coercions: 0, joins: 0/0}
$trModule
$trModule = Module $trModule3 $trModule1
-- RHS size: {terms: 1, types: 3, coercions: 0, joins: 0/0}
api
api = Proxy
-- RHS size: {terms: 3, types: 1, coercions: 0, joins: 0/0}
lvl2_r9QE
lvl2_r9QE = : $fAcceptTYPEJSON4 $fAcceptTYPEJSON2
-- RHS size: {terms: 2, types: 11, coercions: 0, joins: 0/0}
lvl3_r9QF
lvl3_r9QF = \ _ -> lvl2_r9QE
-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
$s$fHasServerTYPE:<|>context_$croute3
$s$fHasServerTYPE:<|>context_$croute3 = 200
-- RHS size: {terms: 12, types: 16, coercions: 0, joins: 0/0}
lvl4_r9QG
lvl4_r9QG
= \ s1_a8on ->
case newByteArray# 10# s1_a8on of { (# ipv_a8or, ipv1_a8os #) ->
$wouter
ipv1_a8os 4# $s$fHasServerTYPE:<|>context_$croute1 0# ipv_a8or
}
-- RHS size: {terms: 12, types: 16, coercions: 0, joins: 0/0}
lvl5_r9QH
lvl5_r9QH
= \ s1_a8on ->
case newByteArray# 10# s1_a8on of { (# ipv_a8or, ipv1_a8os #) ->
$wouter
ipv1_a8os 4# $s$fHasServerTYPE:<|>context_$croute4 0# ipv_a8or
}
-- RHS size: {terms: 4, types: 2, coercions: 0, joins: 0/0}
lvl7_r9QJ
lvl7_r9QJ = ++_$s++ [] $fAcceptTYPEJSON4 $fAcceptTYPEJSON2
-- RHS size: {terms: 2, types: 11, coercions: 0, joins: 0/0}
$s$fAllCTRender:a_$callMime
$s$fAllCTRender:a_$callMime = \ _ -> lvl7_r9QJ
-- RHS size: {terms: 4, types: 2, coercions: 15, joins: 0/0}
$sencode
$sencode
= \ eta_a9of ->
toLazyByteString ((string1 eta_a9of) `cast` <Co:15>)
-- RHS size: {terms: 2, types: 4, coercions: 0, joins: 0/0}
lvl8_r9QK
lvl8_r9QK = \ _ -> $sencode
-- RHS size: {terms: 3, types: 5, coercions: 0, joins: 0/0}
$s$fAllCTRender:a_$s$fMimeRenderTYPEJSONa
$s$fAllCTRender:a_$s$fMimeRenderTYPEJSONa
= C:MimeRender $fAcceptTYPEJSON lvl8_r9QK
-- RHS size: {terms: 5, types: 16, coercions: 0, joins: 0/0}
$s$fAllMimeRender:a0_$callMimeRender
$s$fAllMimeRender:a0_$callMimeRender
= \ _ w2_a9nu ->
$w$callMimeRender1
$s$fAllCTRender:a_$s$fMimeRenderTYPEJSONa w2_a9nu
-- RHS size: {terms: 3, types: 9, coercions: 9, joins: 0/0}
$s$fAllCTRender:a_$s$fAllMimeRender:a0
$s$fAllCTRender:a_$s$fAllMimeRender:a0
= C:AllMimeRender
(lvl3_r9QF `cast` <Co:9>) $s$fAllMimeRender:a0_$callMimeRender
-- RHS size: {terms: 21, types: 30, coercions: 0, joins: 0/0}
$wlvl_r9QL
$wlvl_r9QL
= \ ww_s9J5 ww1_s9J6 ww2_s9J7 ww3_s9J8 w_s9J2 ->
case $w$sparseQuality2 ww_s9J5 ww1_s9J6 ww2_s9J7 ww3_s9J8 of {
Nothing -> Nothing;
Just x1_a9nm ->
mapQuality_$smapQuality2
(map
$fAllCTRender:a1
($w$callMimeRender1
$s$fAllCTRender:a_$s$fMimeRenderTYPEJSONa w_s9J2))
x1_a9nm
}
-- RHS size: {terms: 12, types: 19, coercions: 1, joins: 0/0}
lvl9_r9QM
lvl9_r9QM
= \ _ w1_s9J1 w2_s9J2 ->
case w1_s9J1 `cast` <Co:1> of
{ PS ww1_s9J5 ww2_s9J6 ww3_s9J7 ww4_s9J8 ->
$wlvl_r9QL ww1_s9J5 ww2_s9J6 ww3_s9J7 ww4_s9J8 w2_s9J2
}
-- RHS size: {terms: 3, types: 9, coercions: 9, joins: 0/0}
$s$fAllCTRender:a
$s$fAllCTRender:a
= C:AllCTRender
($s$fAllCTRender:a_$callMime `cast` <Co:9>) lvl9_r9QM
-- RHS size: {terms: 6, types: 33, coercions: 11, joins: 0/0}
lvl11_r9QO
lvl11_r9QO
= \ @ env_a8nX ->
$w$croute13
$s$fAllCTRender:a
($fReflectMethodStdMethodGET_$creflectMethod `cast` <Co:4>)
($s$fHasServerTYPE:<|>context_$croute3 `cast` <Co:7>)
Proxy
-- RHS size: {terms: 23, types: 113, coercions: 18, joins: 0/0}
$s$fHasServerTYPE:>context4_$croute
$s$fHasServerTYPE:>context4_$croute
= \ @ env_a8nX w2_a8nY _ w4_a8o0 ->
case w2_a8nY of { Proxy ->
StaticRouter
(case runRW# lvl4_r9QG of { (# ipv_a8ow, ipv1_a8ox #) ->
case ipv1_a8ox of dt_a8oB { Text ipv2_a8Hq ipv3_a8Hr ipv4_a8Hs ->
Bin 1# dt_a8oB (lvl11_r9QO (w4_a8o0 `cast` <Co:18>)) Tip Tip
}
})
[]
}
-- RHS size: {terms: 23, types: 113, coercions: 18, joins: 0/0}
$s$fHasServerTYPE:>context4_$croute1
$s$fHasServerTYPE:>context4_$croute1
= \ @ env_a8nX w2_a8nY _ w4_a8o0 ->
case w2_a8nY of { Proxy ->
StaticRouter
(case runRW# lvl5_r9QH of { (# ipv_a8ow, ipv1_a8ox #) ->
case ipv1_a8ox of dt_a8oB { Text ipv2_a8Hq ipv3_a8Hr ipv4_a8Hs ->
Bin 1# dt_a8oB (lvl11_r9QO (w4_a8o0 `cast` <Co:18>)) Tip Tip
}
})
[]
}
-- RHS size: {terms: 7, types: 6, coercions: 0, joins: 0/0}
app4
app4 = \ _ _ _ _ _ _ -> app_result
-- RHS size: {terms: 10, types: 13, coercions: 260, joins: 0/0}
app3
app3
= Delayed
(emptyDelayed4 `cast` <Co:34>)
(emptyDelayed2 `cast` <Co:32>)
(emptyDelayed2 `cast` <Co:32>)
(emptyDelayed2 `cast` <Co:32>)
(emptyDelayed2 `cast` <Co:32>)
(emptyDelayed2 `cast` <Co:32>)
(emptyDelayed2 `cast` <Co:32>)
(emptyDelayed1 `cast` <Co:34>)
app4
-- RHS size: {terms: 6, types: 31, coercions: 0, joins: 0/0}
app2
app2
= $w$croute
$s$fHasServerTYPE:>context4_$croute1
$s$fHasServerTYPE:>context4_$croute
Proxy
$WEmptyContext
app3
-- RHS size: {terms: 3, types: 1, coercions: 0, joins: 0/0}
app1
app1 = runRouterEnv app2 ()
-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
app
app = toApplication app1
------ Local rules for imported ids --------
"SPEC/API_2 $fHasServerTYPE:>context4_$croute @ "2" @ (Verb
'GET 200 '[JSON] String) @ '[]" [2]
forall w1_s8Hz w_s8Hy.
$fHasServerTYPE:>context4_$croute w_s8Hy w1_s8Hz
= $s$fHasServerTYPE:>context4_$croute
"SPEC/API_2 $fHasServerTYPE:>context4_$croute @ "1" @ (Verb
'GET 200 '[JSON] String) @ '[]" [2]
forall w1_s8Hw w_s8Hv.
$fHasServerTYPE:>context4_$croute w_s8Hv w1_s8Hw
= $s$fHasServerTYPE:>context4_$croute1
"SPEC/API_2 $fAllCTRender:a_$callMime @ JSON @ '[]" [2]
forall w1_s9j3 w_s9j2.
$fAllCTRender:a_$callMime w_s9j2 w1_s9j3
= $s$fAllCTRender:a_$callMime
"SPEC/API_2 $fAllCTRender:a @ JSON @ '[] @ [Char]"
forall v2_s90i v1_s90h v_s90g.
$fAllCTRender:a v_s90g v1_s90h v2_s90i
= $s$fAllCTRender:a
"SPEC/API_2 $fAllMimeRender:a0_$callMimeRender @ JSON @ [Char]" [2]
forall w_s9nT.
$fAllMimeRender:a0_$callMimeRender w_s9nT
= $s$fAllMimeRender:a0_$callMimeRender
"SPEC/API_2 $fAllMimeRender:a0 @ JSON @ [Char]"
forall v_s9np.
$fAllMimeRender:a0 v_s9np
= $s$fAllCTRender:a_$s$fAllMimeRender:a0
"SPEC/API_2 encode @ [Char]"
forall $dToJSON_s9om. encode $dToJSON_s9om = $sencode
"SPEC/API_2 $fMimeRenderTYPEJSONa @ [Char]"
forall v_s9oc.
$fMimeRenderTYPEJSONa v_s9oc
= $s$fAllCTRender:a_$s$fMimeRenderTYPEJSONa
I'm not sure GHC does anything particularly wrong here, but if anyone sees something, let me know. Perhaps the servant libraries could give better hints so as to make sure users don't end up with gigantic Core. It'd be interesting to see on what kind of Core terms the blowup happens.
Whole core stats:
*** Parser [API_50]:
!!! Parser [API_50]: finished in 7.07 milliseconds, allocated 2.437 megabytes
*** Renamer/typechecker [API_50]:
!!! Renamer/typechecker [API_50]: finished in 731.06 milliseconds, allocated 104.368 megabytes
*** Desugar [API_50]:
Result size of Desugar (before optimization)
= {terms: 942, types: 39,926, coercions: 18,615, joins: 0/161}
Result size of Desugar (after optimization)
= {terms: 828, types: 39,801, coercions: 18,815, joins: 0/154}
!!! Desugar [API_50]: finished in 91.34 milliseconds, allocated 12.113 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
= {terms: 1,352, types: 45,874, coercions: 38,442, joins: 0/1}
Result size of Simplifier iteration=2
= {terms: 1,060, types: 29,382, coercions: 38,442, joins: 0/1}
Result size of Simplifier
= {terms: 1,060, types: 29,382, coercions: 38,442, joins: 0/1}
!!! Simplifier [API_50]: finished in 498.86 milliseconds, allocated 46.616 megabytes
*** Specialise [API_50]:
Result size of Specialise
= {terms: 15,554, types: 810,695, coercions: 85,830, joins: 0/3}
!!! Specialise [API_50]: finished in 333.94 milliseconds, allocated 43.001 megabytes
*** Float out(FOS {Lam = Just 0,
Consts = True,
OverSatApps = False}) [API_50]:
Result size of Float out(FOS {Lam = Just 0,
Consts = True,
OverSatApps = False})
= {terms: 29,526, types: 1,215,461, coercions: 85,830, joins: 0/0}
!!! Float out(FOS {Lam = Just 0,
Consts = True,
OverSatApps = False}) [API_50]: finished in 1163.50 milliseconds, allocated 133.012 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
= {terms: 29,138, types: 1,324,321, coercions: 111,537, joins: 0/0}
Result size of Simplifier iteration=2
= {terms: 144,992,
types: 5,290,053,
coercions: 1,377,705,
joins: 0/0}
Result size of Simplifier iteration=3
= {terms: 144,792,
types: 5,286,303,
coercions: 1,370,505,
joins: 0/0}
Result size of Simplifier
= {terms: 144,792,
types: 5,286,303,
coercions: 1,370,505,
joins: 0/0}
!!! Simplifier [API_50]: finished in 31122.77 milliseconds, allocated 4832.778 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
= {terms: 144,678,
types: 5,285,913,
coercions: 1,370,505,
joins: 0/0}
Result size of Simplifier
= {terms: 144,678,
types: 5,285,913,
coercions: 1,370,505,
joins: 0/0}
!!! Simplifier [API_50]: finished in 14560.75 milliseconds, allocated 2147.932 megabytes
*** Simplifier [API_50]:
Result size of Simplifier
= {terms: 144,678,
types: 5,285,913,
coercions: 1,370,505,
joins: 0/0}
!!! Simplifier [API_50]: finished in 6388.17 milliseconds, allocated 1072.456 megabytes
*** Float inwards [API_50]:
Result size of Float inwards
= {terms: 144,678,
types: 5,285,913,
coercions: 1,370,505,
joins: 0/0}
!!! Float inwards [API_50]: finished in 660.78 milliseconds, allocated 139.137 megabytes
*** Called arity analysis [API_50]:
Result size of Called arity analysis
= {terms: 144,678,
types: 5,285,913,
coercions: 1,370,505,
joins: 0/0}
!!! Called arity analysis [API_50]: finished in 617.06 milliseconds, allocated 92.774 megabytes
*** Simplifier [API_50]:
Result size of Simplifier
= {terms: 144,678,
types: 5,285,913,
coercions: 1,370,505,
joins: 0/0}
!!! Simplifier [API_50]: finished in 7205.29 milliseconds, allocated 1072.456 megabytes
*** Demand analysis [API_50]:
Result size of Demand analysis
= {terms: 144,678,
types: 5,285,913,
coercions: 1,370,505,
joins: 0/0}
!!! Demand analysis [API_50]: finished in 7360.82 milliseconds, allocated 3239.076 megabytes
*** Worker Wrapper binds [API_50]:
Result size of Worker Wrapper binds
= {terms: 149,098,
types: 5,686,426,
coercions: 1,434,172,
joins: 0/671}
!!! Worker Wrapper binds [API_50]: finished in 623.29 milliseconds, allocated 44.550 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
= {terms: 146,708,
types: 5,488,159,
coercions: 1,370,905,
joins: 0/144}
Result size of Simplifier iteration=2
= {terms: 146,397,
types: 5,471,833,
coercions: 1,370,505,
joins: 0/0}
Result size of Simplifier
= {terms: 146,397,
types: 5,471,833,
coercions: 1,370,505,
joins: 0/0}
!!! Simplifier [API_50]: finished in 24290.70 milliseconds, allocated 3566.239 megabytes
*** Exitification transformation [API_50]:
Result size of Exitification transformation
= {terms: 146,397,
types: 5,471,833,
coercions: 1,370,505,
joins: 0/0}
!!! Exitification transformation [API_50]: finished in 612.12 milliseconds, allocated 24.003 megabytes
*** Float out(FOS {Lam = Just 0,
Consts = True,
OverSatApps = True}) [API_50]:
Result size of Float out(FOS {Lam = Just 0,
Consts = True,
OverSatApps = True})
= {terms: 169,649,
types: 6,797,454,
coercions: 1,370,505,
joins: 0/0}
!!! Float out(FOS {Lam = Just 0,
Consts = True,
OverSatApps = True}) [API_50]: finished in 10644.61 milliseconds, allocated 1420.882 megabytes
*** Common sub-expression [API_50]:
Result size of Common sub-expression
= {terms: 43,422,
types: 3,637,259,
coercions: 1,276,012,
joins: 0/0}
!!! Common sub-expression [API_50]: finished in 3329.56 milliseconds, allocated 650.697 megabytes
*** Float inwards [API_50]:
Result size of Float inwards
= {terms: 43,422,
types: 3,637,259,
coercions: 1,276,012,
joins: 0/0}
!!! Float inwards [API_50]: finished in 372.33 milliseconds, allocated 25.084 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
= {terms: 23,508,
types: 2,336,132,
coercions: 1,295,370,
joins: 0/50}
Result size of Simplifier iteration=2
= {terms: 21,880,
types: 2,231,518,
coercions: 1,294,470,
joins: 0/0}
Result size of Simplifier
= {terms: 21,880,
types: 2,231,518,
coercions: 1,294,470,
joins: 0/0}
!!! Simplifier [API_50]: finished in 10672.91 milliseconds, allocated 1784.859 megabytes
*** Demand analysis [API_50]:
Result size of Demand analysis
= {terms: 21,880,
types: 2,231,518,
coercions: 1,294,470,
joins: 0/0}
!!! Demand analysis [API_50]: finished in 5560.72 milliseconds, allocated 2786.071 megabytes
*** CoreTidy [API_50]:
Result size of Tidy Core
= {terms: 3,507, types: 134,928, coercions: 39,021, joins: 0/0}
!!! CoreTidy [API_50]: finished in 1105.27 milliseconds, allocated 349.239 megabytes
Created temporary directory: /run/user/1001/ghc11842_0
*** CorePrep [API_50]:
Result size of CorePrep
= {terms: 3,732, types: 134,408, coercions: 39,021, joins: 0/102}
!!! CorePrep [API_50]: finished in 39.19 milliseconds, allocated 6.130 megabytes
*** Stg2Stg:
*** CodeGen [API_50]:
!!! CodeGen [API_50]: finished in 760.67 milliseconds, allocated 146.051 megabytes
$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 8.6.5
$ ghc-pkg list servant*
/nix/store/j86yn21gz4zly8w31hjd9vd5gw9yd2h1-ghc-8.6.5-with-packages/lib/ghc-8.6.5/package.conf.d
servant-0.15
servant-server-0.15
Would be interesting to take a look at this one with the GHC profiling story improved.
These issues also happens with -O1
. Using -fno-specialize
made my 50 route example compile in <1s vs 10s.
I believe the core issue could be related to the 7 year old TypeFamilies painfully slow issue.
If I'm reading trac correctly, a fix related to exponential inlining should be in GHC 9.0.1. I haven't tested this since ~it appears servant isn't building with 9.0.1 yet~ I don't have time to build servant from source right now.
~It might be interesting to see if -fomit-interface-pragmas
as suggested in this comment help anything.~ (No meaningful change for me)
I don't think the issue is just type families, they're used quite a bit in servant, but not everywhere. Quite likely the problem is similar to what we see with https://gitlab.haskell.org/ghc/ghc/-/issues/5642 and many other tickets, where we have a bunch of values that have different types, and where each sub-expression that combines those values ends up having its own type too. Simply generating the Core to describe all those things ends up taking quite a bit of time and space, and this all has to be processed by every single step of the optimisation pipeline etc. I suspect -ddump-timings
or -dshow-passes
would help illustrate this.
I don't think the issue is just type families, they're used quite a bit in servant, but not everywhere. Quite likely the problem is similar to what we see with https://gitlab.haskell.org/ghc/ghc/-/issues/5642 and many other tickets, where we have a bunch of values that have different types, and where each sub-expression that combines those values ends up having its own type too. Simply generating the Core to describe all those things ends up taking quite a bit of time and space, and this all has to be processed by every single step of the optimisation pipeline etc. I suspect
-ddump-timings
or-dshow-passes
would help illustrate this.
I took your example from above and expanded it to 37 routes (random copy pasting gave an odd number):
{-# LANGUAGE DataKinds, TypeOperators, TypeFamilies #-}
module API_2 where
import Data.Proxy
import Servant
type API = "1" :> Get '[JSON] String :<|> "2" :> Get '[JSON] String
api = Proxy :: Proxy API
server :: Server API
server = return "1" :<|> return "2"
app = serve api server
Compiling this with -O1
took 10s
like:
ghc -O1 -fforce-recomp 37.hs +RTS -s
Turning off pre-inlining
took it from 10s
to 1s
with command:
ghc -O1 -fforce-recomp -fno-pre-inlining 37.hs +RTS -s -RTS
Before I knew to try -fno-pre-inlining
I ran -dshow-passes
like:
ghc -O1 -fforce-recomp -ddump-timings -dshow-passes 37.hs +RTS -s -RTS
Resulting in:
,*** Simplifier [API_2]:
Result size of Simplifier iteration=1
= {terms: 13,936, types: 974,580, coercions: 58,364, joins: 0/41}
Result size of Simplifier iteration=2
= {terms: 50,614,
types: 1,907,589,
coercions: 571,691,
joins: 0/699}
Result size of Simplifier iteration=3
= {terms: 50,654,
types: 1,909,075,
coercions: 572,239,
joins: 0/76}
Result size of Simplifier iteration=4
= {terms: 50,578,
types: 1,908,885,
coercions: 572,239,
joins: 0/38}
Result size of Simplifier
= {terms: 50,578,
types: 1,908,885,
coercions: 572,239,
joins: 0/38}
!!! Simplifier [API_2]: finished in 2286.12 milliseconds, allocated 2415.010 megabytes
Simplifier [API_2]: alloc=2532321584 time=2286.117
I think that -fno-pre-inlining
would be disabling this initial simplifier pass though:
,*** Simplifier [API_2]:
Result size of Simplifier iteration=1
= {terms: 712, types: 24,452, coercions: 22,028, joins: 0/2}
Result size of Simplifier iteration=2
= {terms: 640, types: 16,091, coercions: 22,028, joins: 0/1}
Result size of Simplifier
= {terms: 640, types: 16,091, coercions: 22,028, joins: 0/1}
!!! Simplifier [API_2]: finished in 77.03 milliseconds, allocated 39.149 megabytes
Simplifier [API_2]: alloc=41050960 time=77.035
Where things grow, but definitely don't blow up. Types blow up in the specialization pass right after:
,*** Specialise [API_2]:
Result size of Specialise
= {terms: 3,928, types: 374,371, coercions: 39,797, joins: 0/3}
!!! Specialise [API_2]: finished in 73.37 milliseconds, allocated 108.150 megabytes
Specialise [API_2]: alloc=113403720 time=73.374
Maybe if I do -dshow-passes
with -fno-pre-inlining
it will demonstrate this...
Edit: With -fno-pre-inlining
there is no blow up in specialization for some reason.
,*** Simplifier [API_2]:
Result size of Simplifier iteration=1
= {terms: 716, types: 24,946, coercions: 22,028, joins: 0/2}
Result size of Simplifier iteration=2
= {terms: 714, types: 24,934, coercions: 22,028, joins: 0/1}
Result size of Simplifier
= {terms: 714, types: 24,934, coercions: 22,028, joins: 0/1}
!!! Simplifier [API_2]: finished in 77.12 milliseconds, allocated 41.014 megabytes
Simplifier [API_2]: alloc=43006656 time=77.116
,*** Specialise [API_2]:
Result size of Specialise
= {terms: 960, types: 32,776, coercions: 22,457, joins: 0/3}
!!! Specialise [API_2]: finished in 16.61 milliseconds, allocated 15.842 megabytes
Note that types were at 374,371
without -fno-pre-inlining
and here they are 32,776
.
I'm guessing if someone were to see the core after pre-inlining and after specialization they might be able to say what optimization causes the types blowup?
GHC 9 takes things with that example from 10s to 16s, so going in the wrong direction...
Something interesting is the blow up seems to happen in the 3rd chunk of "Simplifier" instead of the inliner though using 9.0. Hopefully GHC 9.2 helps, but servant definitely doesn't build with it right now and I'm not sure how to make this problem more minimal yet.
I think you are right @alpmestan, I tried replicating the compile slowdown with mini servant and could not.
I noticed one thing that the minimal version is missing is content types. Is there a way to isolate testing just the content type pieces compile time?
I'm currently trying to figure out if that's possible understand how the content type machinery works with the serve function, but that will take some time for me.
I also wonder if ghc is just slow with type level lists and the operations that seem to be behind a lot of the content-type machinery
I noticed one thing that the minimal version is missing is content types. Is there a way to isolate testing just the content type pieces compile time?
Probably it's not about the content types alone, but about adding them to everything else. However, you can grab the Accept
/MimeRender
/MimeUnrender
classes and specify a few dummy content types that you'd instantiate those classes for, if you want to add that to your testing. This is all there is to content types and the serve
function too. The HasServer
instances for verbs or request bodies end up requiring suitable MimeRender
/MimeUnrender
instances for all the content types in the list, as you can see e.g here: https://github.com/haskell-servant/servant/blob/master/servant-server/src/Servant/Server/Internal.hs#L294 (the AllCTRender ctypes a
constraint).
Just ran the same benchmark as @bergey in the first post on my machine (laptop from like 2014-2015), and then changed all benchmark implementations to using NamedRoutes
.
I built everything with lts-18.27
, because I was getting panics from GHC 8.2.2.
And I ran bench --before 'stack clean' -L 120 'stack build --ghc-options="-O2"'
It seems it fixes the quadratic times in compilation:
number of routes | time (sec) Old | time (sec) NamedRoutes |
---|---|---|
1 | 7.5 | 8.1 |
8 | 9.0 | 9.2 |
16 | 13.2 | 10.6 |
32 | 30.7 | 13.4 |
64 | 141.2 | 19.3 |
Here examples of how the modules look and the complete output of the bench:
Example of the anonymous routes
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where
import Data.Semigroup
import Data.Text (Text)
import Servant hiding (NotSecure)
import Servant.Server
import Network.Wai.Handler.Warp (defaultSettings, runSettings,
setLogger, setPort)
import qualified Data.Text as T
type MainAPI =
"static" :> "1" :> Get '[JSON] Text
:<|> "static" :> "2" :> Get '[JSON] Text
:<|> "static" :> "3" :> Get '[JSON] Text
:<|> "static" :> "4" :> Get '[JSON] Text
:<|> "static" :> "5" :> Get '[JSON] Text
:<|> "static" :> "6" :> Get '[JSON] Text
:<|> "static" :> "7" :> Get '[JSON] Text
:<|> "static" :> "8" :> Get '[JSON] Text
mainServer :: Server MainAPI
mainServer = foo :<|> foo :<|> foo :<|> foo
:<|> foo :<|> foo :<|> foo :<|> foo
foo = return "foo"
main :: IO ()
main = runSettings defaultSettings $
serveWithContext (Proxy :: Proxy MainAPI) EmptyContext mainServer
Benchmark output of anonymous routes
/flat-1
benchmarking stack build --ghc-options="-O2"
time 7.427 s (7.102 s .. 7.745 s)
0.999 R² (0.998 R² .. 1.000 R²)
mean 7.458 s (7.395 s .. 7.551 s)
std dev 98.88 ms (29.73 ms .. 142.7 ms)
variance introduced by outliers: 16% (moderately inflated)
/flat-8
benchmarking stack build --ghc-options="-O2"
time 9.024 s (8.667 s .. 9.838 s)
0.999 R² (0.998 R² .. 1.000 R²)
mean 9.027 s (8.806 s .. 9.188 s)
std dev 226.5 ms (119.0 ms .. 315.0 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-16
benchmarking stack build --ghc-options="-O2"
time 13.36 s (12.86 s .. 14.10 s)
1.000 R² (0.999 R² .. 1.000 R²)
mean 13.21 s (12.81 s .. 13.36 s)
std dev 284.8 ms (53.22 ms .. 372.0 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-32
benchmarking stack build --ghc-options="-O2"
time 30.73 s (30.19 s .. 30.98 s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 30.72 s (30.62 s .. 30.80 s)
std dev 108.2 ms (48.80 ms .. 142.1 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-64
benchmarking stack build --ghc-options="-O2"
time 144.4 s (134.1 s .. 159.8 s)
0.998 R² (0.998 R² .. 1.000 R²)
mean 141.2 s (138.7 s .. 143.6 s)
std dev 2.836 s (2.113 s .. 3.384 s)
variance introduced by outliers: 19% (moderately inflated)
Example of NamedRoutes
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeOperators #-}
module Main where
import Data.Semigroup
import Data.Text (Text)
import GHC.Generics (Generic)
import Servant hiding (NotSecure)
import Servant.API.Generic
import Servant.Server
import Network.Wai.Handler.Warp (defaultSettings, runSettings,
setLogger, setPort)
import qualified Data.Text as T
type MainAPI = NamedRoutes API
data API mode = API
{ static1 :: mode :- "static" :> "1" :> Get '[JSON] Text
, static2 :: mode :- "static" :> "2" :> Get '[JSON] Text
, static3 :: mode :- "static" :> "3" :> Get '[JSON] Text
, static4 :: mode :- "static" :> "4" :> Get '[JSON] Text
, static5 :: mode :- "static" :> "5" :> Get '[JSON] Text
, static6 :: mode :- "static" :> "6" :> Get '[JSON] Text
, static7 :: mode :- "static" :> "7" :> Get '[JSON] Text
, static8 :: mode :- "static" :> "8" :> Get '[JSON] Text
} deriving Generic
mainServer :: Server MainAPI
mainServer = API
{ static1 = foo
, static2 = foo
, static3 = foo
, static4 = foo
, static5 = foo
, static6 = foo
, static7 = foo
, static8 = foo
}
foo = return "foo"
main :: IO ()
main = runSettings defaultSettings $
serveWithContext (Proxy :: Proxy MainAPI) EmptyContext mainServer
Benchmark output of NamedRoutes
/flat-1
benchmarking stack build --ghc-options="-O2"
time 7.882 s (7.136 s .. 8.371 s)
0.999 R² (0.997 R² .. 1.000 R²)
mean 8.137 s (7.986 s .. 8.224 s)
std dev 144.6 ms (40.75 ms .. 192.8 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-8
benchmarking stack build --ghc-options="-O2"
time 9.142 s (8.178 s .. 9.885 s)
0.999 R² (0.996 R² .. 1.000 R²)
mean 9.275 s (9.155 s .. 9.473 s)
std dev 194.6 ms (66.72 ms .. 264.1 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-16
benchmarking stack build --ghc-options="-O2"
time 10.66 s (9.664 s .. 11.77 s)
0.999 R² (0.995 R² .. 1.000 R²)
mean 10.56 s (10.40 s .. 10.76 s)
std dev 195.4 ms (84.90 ms .. 271.9 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-32
benchmarking stack build --ghc-options="-O2"
time 13.59 s (12.54 s .. 14.52 s)
0.999 R² (0.997 R² .. 1.000 R²)
mean 13.35 s (13.13 s .. 13.53 s)
std dev 248.5 ms (82.63 ms .. 338.9 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-64
benchmarking stack build --ghc-options="-O2"
time 19.27 s (18.93 s .. 19.82 s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 19.33 s (19.22 s .. 19.40 s)
std dev 114.1 ms (66.46 ms .. 160.4 ms)
variance introduced by outliers: 19% (moderately inflated)
If anyone else wants to bench the NamedRoutes: Here's my fork
You can bench the anonymous routes by checking out this commit: de54845850449acd222cd98aa70fa7003a1a60b3
It is extremely surprising to me that NamedRoutes
would bring performance gains on flat APIs (some gains on nested routes would have been less surprising IMHO).
I am guessing NamedRoutes
goes in the way of inlining, which somehow is a good thing here ?
Is it maybe because the flat API isn't actually flat with the anonymous routes, but nested with every extra :<|>
?
Route1 :<|> (Route2 :<|> (Route3 :<|> ...) ) )
The HasServer
instance for NamedRoutes
still relies on :<|>
under the hood, see here:
https://hackage.haskell.org/package/servant-server-0.19/docs/src/Servant.Server.Internal.html#line-929
We piggy-back on the instance of ToServantApi api
which is a vanilla servant API.
I am guessing NamedRoutes goes in the way of inlining, which somehow is a good thing here ?
2 ideas then:
- generate the core and diff each version or analyze like done here
- Run each of the 64 route benchmarks with
-dshow-passes
and find the largest differences
Maybe there's a better way to say "print all fired optimizations" in a way you could diff though.
For what it's worth, I was not able to reproduce those benchmark results. For me, named routes were slower than the old way of doing things by a fair bit. You can see the entire benchmark script and results here: https://gist.github.com/tfausak/8019733fb5c703994d1665143c60ad0f
Click to expand example "old" module.
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TypeOperators #-}
module M ( main ) where
import qualified Network.Wai as Wai
import qualified Network.Wai.Handler.Warp as Warp
import qualified Servant
main :: IO ()
main = Warp.run 3000 application
application :: Wai.Application
application = Servant.serve proxy server
proxy :: Servant.Proxy Api
proxy = Servant.Proxy
server :: Servant.Server Api
server
= pure 1
Servant.:<|> pure 2
type Api
= "1" Servant.:> Servant.Get '[Servant.JSON] Int
Servant.:<|> "2" Servant.:> Servant.Get '[Servant.JSON] Int
Click to expand example "new" module.
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE TypeOperators #-}
module M ( main ) where
import qualified Network.Wai as Wai
import qualified Network.Wai.Handler.Warp as Warp
import qualified Servant
import qualified Servant.API.Generic as Servant
import qualified Servant.Server.Generic as Servant
import qualified GHC.Generics as Generics
main :: IO ()
main = Warp.run 3000 application
application :: Wai.Application
application = Servant.genericServe routes
routes :: Routes Servant.AsServer
routes = Routes
{ _1 = pure 1
, _2 = pure 2
}
data Routes routes = Routes
{ _1 :: routes Servant.:- "1" Servant.:> Servant.Get '[Servant.JSON] Int
, _2 :: routes Servant.:- "2" Servant.:> Servant.Get '[Servant.JSON] Int
} deriving (Generics.Generic)
With 64 routes, the old module compiles in 4.4 seconds compared to 16.2 seconds for the new module.
I know that benchmarking is hard, so perhaps I've done something wrong here. Or maybe my server definitions are subtly different than @Vlix's in some way. But I wasn't able to get any module with named routes compiling faster than an equivalent module with plain old (:<|>)
.
@tfausak The only obvious difference between your two benchmarks is that in @Vlix's, all handlers are the same (foo
). But maybe does it actually make a difference ?
Also, I notice that compilation times for the :<|>
routes do not seem quadratic at all in the number of routes in your benchmarks.
@tfausak
I'm not completely sure, but it looks like you're building with -O1
, where we're all talking about compiling with -O2
(at least that's what the title of this issue states)
EDIT: Their try gave me an idea, though, brb
EDIT2: Ok, it's not done benching, but I think I'm already seeing what happened. I didn't build the "old timing" with servant-0.19
(just servant-0.18.3
because that's in lts-18.27
), because it didn't need it (don't need NamedRoutes to bench the old stuff). And I was fine with the results, because they confirmed bergey
's initial issue. Then tried with NamedRoutes
and servant-0.19
and then contributed it to the NamedRoutes
.
I tried the "old timing" benchmark with /servant-(server-)?0.19/
in the extra-deps
and yeah, the quadratic building times disappear O_o
So SOMETHING happened between /servant-(server-)?0.18.3/
and /servant-(server-)?0.19/
!
tl;dr MY BAD! It's not NamedRoutes
! It's the latest release of servant-(server-)0.19
!
Ok, rebuilt again with just lts-18.27
and then with extra-deps: [servant-0.19, servant-server-0.19]
(added the NamedRoutes times again just for reference, did not rebench those)
So yeah... if anyone knows the diff between 0.18.3
and 0.19
, please explain :sweat_smile:
(it has to be in either servant-0.19
or servant-server-0.19
, because those are the only ones different between these two benches)
number of routes | time (sec) anonymous routes (0.18.3) | time (sec) anonymous routes (0.19) | time (sec) NamedRoutes (0.19) -- for reference |
---|---|---|---|
1 | 7.6 | 7.6 | 8.1 |
8 | 9.2 | 8.3 | 9.2 |
16 | 13.3 | 8.8 | 10.6 |
32 | 30.9 | 10.6 | 13.4 |
64 | 137.8 | 15.1 | 19.3 |
Regular (flat) anonymous routes with servant(-server)-0.18.3
/flat-1
benchmarking stack build --ghc-options="-O2"
time 7.568 s (7.233 s .. 7.800 s)
0.999 R² (0.997 R² .. 1.000 R²)
mean 7.603 s (7.491 s .. 7.649 s)
std dev 88.08 ms (16.25 ms .. 120.2 ms)
variance introduced by outliers: 16% (moderately inflated)
/flat-8
benchmarking stack build --ghc-options="-O2"
time 8.852 s (8.185 s .. 9.748 s)
0.999 R² (0.996 R² .. NaN R²)
mean 9.192 s (9.027 s .. 9.422 s)
std dev 219.0 ms (86.83 ms .. 280.4 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-16
benchmarking stack build --ghc-options="-O2"
time 13.17 s (12.31 s .. 13.69 s)
1.000 R² (0.999 R² .. 1.000 R²)
mean 13.26 s (13.17 s .. 13.35 s)
std dev 117.4 ms (38.85 ms .. 160.3 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-32
benchmarking stack build --ghc-options="-O2"
time 30.62 s (30.06 s .. 31.17 s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 30.87 s (30.73 s .. 30.98 s)
std dev 162.3 ms (107.5 ms .. 193.3 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-64
benchmarking stack build --ghc-options="-O2"
time 140.1 s (131.9 s .. 155.9 s)
0.999 R² (0.997 R² .. 1.000 R²)
mean 137.8 s (135.7 s .. 139.5 s)
std dev 2.217 s (959.9 ms .. 2.979 s)
variance introduced by outliers: 19% (moderately inflated)
Regular (flat) anonymous routes with servant(-server)-0.19
/flat-1
benchmarking stack build --ghc-options="-O2"
time 7.424 s (7.307 s .. 7.547 s)
1.000 R² (0.998 R² .. 1.000 R²)
mean 7.630 s (7.522 s .. 7.754 s)
std dev 140.9 ms (104.1 ms .. 175.0 ms)
variance introduced by outliers: 16% (moderately inflated)
/flat-8
benchmarking stack build --ghc-options="-O2"
time 8.232 s (7.737 s .. 9.237 s)
0.998 R² (0.997 R² .. 1.000 R²)
mean 8.264 s (8.000 s .. 8.464 s)
std dev 272.4 ms (141.9 ms .. 375.2 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-16
benchmarking stack build --ghc-options="-O2"
time 8.760 s (7.966 s .. 9.571 s)
0.999 R² (0.996 R² .. 1.000 R²)
mean 8.797 s (8.600 s .. 8.876 s)
std dev 136.9 ms (40.51 ms .. 183.0 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-32
benchmarking stack build --ghc-options="-O2"
time 10.65 s (9.727 s .. 11.22 s)
0.999 R² (0.998 R² .. 1.000 R²)
mean 10.59 s (10.34 s .. 10.73 s)
std dev 240.8 ms (119.2 ms .. 306.2 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-64
benchmarking stack build --ghc-options="-O2"
time 14.81 s (14.36 s .. 15.46 s)
1.000 R² (0.999 R² .. 1.000 R²)
mean 15.11 s (14.93 s .. 15.40 s)
std dev 269.4 ms (1.918 ms .. 330.8 ms)
variance introduced by outliers: 19% (moderately inflated)
You could run git bisect with a script for 32 routes that fails if it takes more than 20 seconds to narrow things down if there are a lot of commits.
tl;dr MY BAD! It's not NamedRoutes! It's the latest release of servant-(server-)0.19!
The plot thickens :-) Off the top of my head, I can't think of any change in 0.19 that could be responsible for such improvements. I am going to try bissecting on my end.
@Vlix I am not able to reproduce your results either. Compilation times for a module of 64 anonymous routes seems pretty much the same on HEAD
and on 0.18.3
:
de923fc887f9e0197c8795fcb0e6967b196e0e17
(HEAD):
benchmarking cabal build flat-64
time 5.153 s (5.121 s .. NaN s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 5.137 s (5.131 s .. 5.147 s)
std dev 9.297 ms (417.9 μs .. 11.61 ms)
variance introduced by outliers: 19% (moderately inflated)
e2b897d3c0655a84551532944f9cffaa9daa8168
(Prepare servant-0.18.3 release):
benchmarking cabal build flat-64
time 5.047 s (5.027 s .. 5.076 s)
1.000 R² (1.000 R² .. 1.000 R²)
mean 5.036 s (5.032 s .. 5.041 s)
std dev 6.011 ms (1.940 ms .. 8.131 ms)
variance introduced by outliers: 19% (moderately inflated)
This is using GHC 8.10.4 though.
I am able to reproduce the performance difference between Servant 0.18.3 and Servant 0.19. I'm using the same setup as before, which is with Stackage LTS 18.28, which uses GHC 8.10.7.
Number of routes | Servant 0.18 build time (seconds) | Servant 0.19 build time (seconds) |
---|---|---|
8 | 1.09 | 0.92 |
16 | 2.73 | 1.20 |
32 | 10.83 | 1.90 |
64 | 63.87 | 4.44 |
Comparing the output of stack ls dependencies
, this is all that shows up in the diff:
46a47
> constraints 0.13.3
114,115c115,116
< servant 0.18.3
< servant-server 0.18.3
---
> servant 0.19
> servant-server 0.19
142a144
> type-equality 1
Just tried with GHC 8.10.7 in a slightly modified servant repository — still can't reproduce the huge gains witnessed by some, so obviously I am not able to git bissect
. If anything, those gains seem quite brittle, but it would be good to get some understanding on what is going on here.
@ParetoOptimalDev It looks like you intended to go down that route, did you by any chance figure it out ?
Just tried with GHC 8.10.7 in a slightly modified servant repository — still can't reproduce the huge gains witnessed by some, so obviously I am not able to
git bissect
. If anything, those gains seem quite brittle, but it would be good to get some understanding on what is going on here.@ParetoOptimalDev It looks like you intended to go down that route, did you by any chance figure it out ?
Are you using a Stack LTS? Or are you using cabal? :thinking: (Could you maybe literally share the commands you run to do the benchmarking?)
Are you using a Stack LTS? Or are you using cabal?
This is using cabal
, indeed, and not stack. We're gonna try and see what happens with stack
.
I am benching with something along those lines:
cabal build --only-dependencies flat-64
bench -o results/(git rev-parse HEAD).html --before 'rm -rf ./dist-newstyle/build/x86_64-linux/ghc-8.10.7/flat64-0.1.0.0' 'cabal build flat-64'
where flat-64
is as in the benchmarks linked above (but with -O2
forcefully added to ghc-options
in the Cabal file).
My changes are in this commit: https://github.com/haskell-servant/servant/commit/142681c9173ff72d2b1060025ceadeebba1aa8e8
If you use nix, you may run the tests in the same environment as I using nix-shell nix/shell2.nix
.
where flat-64