servant icon indicating copy to clipboard operation
servant copied to clipboard

compile time quadratic in number of routes, with -O2

Open bergey opened this issue 6 years ago • 43 comments

I've been digging a bit to understand where time is spent while building my $WORK codebase, and found this interesting behavior. I'm hoping others interested in investigating the root cause will discuss here. I'm open to changes in application code, servant, or GHC, if we can narrow the cause enough.

The small examples I'm using for benchmarking are at https://github.com/bergey/studious-telegram/tree/trunk/servant I'm timing with bench invoked as bench 'stack clean && stack build. I'm running on a GCP VM with nothing else running on it. With stack build --fast (which I believe passes -O0 to GHC), the build time doesn't change dramatically as the number of HTTP routes increases. Without --fast:

number of routes time (seconds)
1 4
8 4.8
16 7
32 16.3
64 74

The only mitigation I've found so far is testing as much as possible in GHCi or with --fast. This may be related to #870; I have not yet tested with anything except stackage-11.14, which has GHC-8.2.2 and servant-0.13.0.1.

bergey avatar Jun 21 '18 19:06 bergey

I think we've seen a couple of similar issues. A rather efficient workaround is to split up the API and servers in several modules (which most people tend to do naturally anyways, when web apps become that large), as things suddenly become a lot less overwhelming for GHC.

But yes, any information would be useful. If you can try other servant & GHC versions, that would be lovely.

This is however definitely annoying and I would personally be delighted to see any improvement there, may it come from servant's code that would put less pressure on GHC, or from GHC dealing with this type of code more efficiently. The first thing to do though, for anyone interested in investigating this, to figure out what's going on and where all that time is spent, would be to run a profiling-enabled GHC over your module.

alpmestan avatar Jun 21 '18 21:06 alpmestan

I tested with a few other versions.
LTS-9.21 - servant-0.11, GHC-8.0.1 LTS-6.35 - servant-0.7.1, GHC-7.10.3 I'd like to try other servant / ghc combinations, but these were the fastest to test.

number of routes LTS-6.35 LTS-9.21 LTS-11.14
1 4.1 5.1 4.04
8 4.4 5.1 4.86
16 4.6 5.6 7.02
32 6.7 6.7 16.34
64 13.5 11.3 74
128 60 44.4

So it's been quadratic for some time, it seems as though it got substantially worse either in GHC 8.2 or since servant-0.13. I'll take a look at which next time I work on this - and also on the profiling-enabled GHC.

My production, non-benchmark code base does indeed split the routes across modules. I tried to measure some difference from that earlier today, but perhaps I didn't have enough routes to see an effect. It's good to hear that splitting is helpful.

bergey avatar Jun 21 '18 22:06 bergey

cabal new-build servant-bench8  6,25s user 0,80s system 100% cpu 6,999 total
cabal new-build servant-bench16  7,23s user 0,88s system 100% cpu 8,089 total
cabal new-build servant-bench32  12,14s user 0,77s system 100% cpu 12,870 total:1
cabal new-build servant-bench64  43,82s user 1,22s system 99% cpu 45,050 total

Building with -ddump-to-file -ddump-simpl-stats

16

1770 UnfoldingDone
  287 Servant.Server.Internal.$fHasServerTYPE:<|>context_$choistServerWithContext
  263 $dHasServer_ahv6

32

5282 UnfoldingDone
  1335 Servant.Server.Internal.$fHasServerTYPE:<|>context_$choistServerWithContext
  783 $dHasServer_ahxq

64

17682 UnfoldingDone
  5735 Servant.Server.Internal.$fHasServerTYPE:<|>context_$choistServerWithContext
  2591 $dHasServer_ahC4

hoist?

As we inline HasServer hoistServerWithContexts are generated, though we don't use them. GHC is silly here.

This reminds me of https://ghc.haskell.org/trac/ghc/ticket/13253, particularly something like: https://ghc.haskell.org/trac/ghc/ticket/13253#comment:22

There's also a patch in https://ghc.haskell.org/trac/ghc/ticket/13253#comment:24, if alpmestan could try to apply it on top of 8.4.3 (which also has the same behaviour), and see if helps here?

phadej avatar Jun 21 '18 22:06 phadej

This is still a problem in GHC 8.6.3.

This example takes me 60 sec to build with only 20 routes. Running client against the routes individually reduces that to 37 sec (as we still pass serve the entire API).

rdnetto avatar Jan 15 '19 07:01 rdnetto

I looked a bit more into this. I generated a few modules that look like:

{-# LANGUAGE DataKinds, TypeOperators, TypeFamilies #-}
module API_2 where
import Data.Proxy
import Servant
type API = "1" :> Get '[JSON] String :<|> "2" :> Get '[JSON] String
api = Proxy :: Proxy API
server :: Server API
server = return "1" :<|> return "2"
app = serve api server

where I'm varying the number of endpoints from a module to another. I have APIs with 1,2,3,4,5,10,20,30,40,50 endpoints. And I run this little script:

#! /usr/bin/env sh

rm -f API_*

for i in 1 2 3 4 5 10 20 30 40 50; do
    runghc gen_api.hs $i
    time ghc -c --make API_$i.hs -o API_$i.o \
	-freduction-depth=0 -O -ddump-timings -v2 \
	-ddump-simpl -ddump-simpl-stats -ddump-to-file \
	-dsuppress-all \
	+RTS -s -RTS |& tee API_$i.log.txt
done
# of endpoints Compile time (s)
1 2.101
2 2.321
3 3.301
4 3.591
5 4.371
10 7.961
20 28.101
30 58.111
40 98.181
50 139.021

I can see some interesting things in the log. For example, if we look at the Core optimisation stats for the 50-endpoints API...

We see (see the bottom of this comment) that parsing/renaming/typechecking/desugaring don't take very long, and that after desugaring, the code is not too big yet:

Result size of Desugar (before optimization)
  = {terms: 942, types: 39,926, coercions: 18,615, joins: 0/161}
Result size of Desugar (after optimization)
  = {terms: 828, types: 39,801, coercions: 18,815, joins: 0/154}

The simplifier runs on that, the result size doesn't change all that much. But then, boom, we specialise:

Result size of Simplifier
  = {terms: 1,060, types: 29,382, coercions: 38,442, joins: 0/1}
!!! Simplifier [API_50]: finished in 498.86 milliseconds, allocated 46.616 megabytes
*** Specialise [API_50]:
Result size of Specialise
  = {terms: 15,554, types: 810,695, coercions: 85,830, joins: 0/3}
!!! Specialise [API_50]: finished in 333.94 milliseconds, allocated 43.001 megabytes

We float out:

*** Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [API_50]:
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = False})
  = {terms: 29,526, types: 1,215,461, coercions: 85,830, joins: 0/0}
!!! Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [API_50]: finished in 1163.50 milliseconds, allocated 133.012 megabytes

and try to simplify:

*** Simplifier [API_50]:
Result size of Simplifier iteration=1
  = {terms: 29,138, types: 1,324,321, coercions: 111,537, joins: 0/0}
Result size of Simplifier iteration=2
  = {terms: 144,992,
     types: 5,290,053,
     coercions: 1,377,705,
     joins: 0/0}
Result size of Simplifier iteration=3
  = {terms: 144,792,
     types: 5,286,303,
     coercions: 1,370,505,
     joins: 0/0}
Result size of Simplifier
  = {terms: 144,792,
     types: 5,286,303,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 31122.77 milliseconds, allocated 4832.778 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
Result size of Simplifier
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 14560.75 milliseconds, allocated 2147.932 megabytes
*** Simplifier [API_50]:
Result size of Simplifier
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 6388.17 milliseconds, allocated 1072.456 megabytes

and from that point onwards, we're working with thousands and thousands of terms, millions of types and coercions. Many passes take whole seconds and gigabytes of allocations, until...

*** Demand analysis [API_50]:
Result size of Demand analysis
  = {terms: 21,880,
     types: 2,231,518,
     coercions: 1,294,470,
     joins: 0/0}
!!! Demand analysis [API_50]: finished in 5560.72 milliseconds, allocated 2786.071 megabytes
*** CoreTidy [API_50]:
Result size of Tidy Core
  = {terms: 3,507, types: 134,928, coercions: 39,021, joins: 0/0}
!!! CoreTidy [API_50]: finished in 1105.27 milliseconds, allocated 349.239 megabytes
Created temporary directory: /run/user/1001/ghc11842_0
*** CorePrep [API_50]:
Result size of CorePrep
  = {terms: 3,732, types: 134,408, coercions: 39,021, joins: 0/102}
!!! CorePrep [API_50]: finished in 39.19 milliseconds, allocated 6.130 megabytes
*** Stg2Stg:
*** CodeGen [API_50]:
!!! CodeGen [API_50]: finished in 760.67 milliseconds, allocated 146.051 megabytes

Core Tidy really deserves its name. Here are the timings & allocs for all the passes:

Pass Time (ms) Alloc (bytes)
Parser 7.068 2554992
Renamer/typechecker 731.063 109437544
Desugar 91.337 12701224
Simplifier 498.861 48880936
Specialise 333.945 45090256
Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = False}) 1163.496 139472808
Simplifier 31122.767 5067534720
Simplifier 14560.745 2252270040
Simplifier 6388.172 1124551928
Float inwards 660.780 145896080
Called arity analysis 617.060 97280936
Simplifier 7205.292 1124552056
Demand analysis 7360.820 3396416952
Worker Wrapper binds 623.287 46714320
Simplifier 24290.701 3739472320
Exitification transformation 612.117 25169256
Float out(FOS {Lam = Just 0, Consts = True, OverSatApps = True}) 10644.614 1489902464
Common sub-expression 3329.560 682305136
Float inwards 372.326 26302928
Simplifier 10672.911 1871560408
Demand analysis 5560.725 2921407264
CoreTidy 1105.268 366203432
CorePrep 39.191 6427320
CodeGen 760.668 153145192

I'm not going to look at the Core for API_50...

$ wc -l API_*.dump-simpl
    944 API_10.dump-simpl
    219 API_1.dump-simpl
   2134 API_20.dump-simpl
    281 API_2.dump-simpl
   3724 API_30.dump-simpl
    349 API_3.dump-simpl
   5714 API_40.dump-simpl
    422 API_4.dump-simpl
   8104 API_50.dump-simpl
    499 API_5.dump-simpl

but let's see what we get for API_2.

==================== Tidy Core ====================

Result size of Tidy Core
  = {terms: 243, types: 840, coercions: 573, joins: 0/0}

-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
$s$fHasServerTYPE:<|>context_$croute2
$s$fHasServerTYPE:<|>context_$croute2 = "2"#

-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
$s$fHasServerTYPE:<|>context_$croute1
$s$fHasServerTYPE:<|>context_$croute1
  = unpackCString# $s$fHasServerTYPE:<|>context_$croute2

-- RHS size: {terms: 2, types: 3, coercions: 0, joins: 0/0}
lvl_r9QC
lvl_r9QC = Right $s$fHasServerTYPE:<|>context_$croute1

-- RHS size: {terms: 4, types: 12, coercions: 0, joins: 0/0}
app6
app6 = \ s_a6U0 -> (# s_a6U0, lvl_r9QC #)

-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
$s$fHasServerTYPE:<|>context_$croute5
$s$fHasServerTYPE:<|>context_$croute5 = "1"#

-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
$s$fHasServerTYPE:<|>context_$croute4
$s$fHasServerTYPE:<|>context_$croute4
  = unpackCString# $s$fHasServerTYPE:<|>context_$croute5

-- RHS size: {terms: 2, types: 3, coercions: 0, joins: 0/0}
lvl1_r9QD
lvl1_r9QD = Right $s$fHasServerTYPE:<|>context_$croute4

-- RHS size: {terms: 4, types: 12, coercions: 0, joins: 0/0}
app7
app7 = \ s_a6U0 -> (# s_a6U0, lvl1_r9QD #)

-- RHS size: {terms: 3, types: 6, coercions: 36, joins: 0/0}
app5
app5 = :<|> (app7 `cast` <Co:18>) (app6 `cast` <Co:18>)

-- RHS size: {terms: 1, types: 0, coercions: 98, joins: 0/0}
server
server = app5 `cast` <Co:98>

-- RHS size: {terms: 2, types: 5, coercions: 98, joins: 0/0}
app_result
app_result = Route (app5 `cast` <Co:98>)

-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
$trModule2
$trModule2 = "API_2"#

-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
$trModule1
$trModule1 = TrNameS $trModule2

-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
$trModule4
$trModule4 = "main"#

-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
$trModule3
$trModule3 = TrNameS $trModule4

-- RHS size: {terms: 3, types: 0, coercions: 0, joins: 0/0}
$trModule
$trModule = Module $trModule3 $trModule1

-- RHS size: {terms: 1, types: 3, coercions: 0, joins: 0/0}
api
api = Proxy

-- RHS size: {terms: 3, types: 1, coercions: 0, joins: 0/0}
lvl2_r9QE
lvl2_r9QE = : $fAcceptTYPEJSON4 $fAcceptTYPEJSON2

-- RHS size: {terms: 2, types: 11, coercions: 0, joins: 0/0}
lvl3_r9QF
lvl3_r9QF = \ _ -> lvl2_r9QE

-- RHS size: {terms: 1, types: 0, coercions: 0, joins: 0/0}
$s$fHasServerTYPE:<|>context_$croute3
$s$fHasServerTYPE:<|>context_$croute3 = 200

-- RHS size: {terms: 12, types: 16, coercions: 0, joins: 0/0}
lvl4_r9QG
lvl4_r9QG
  = \ s1_a8on ->
      case newByteArray# 10# s1_a8on of { (# ipv_a8or, ipv1_a8os #) ->
      $wouter
        ipv1_a8os 4# $s$fHasServerTYPE:<|>context_$croute1 0# ipv_a8or
      }

-- RHS size: {terms: 12, types: 16, coercions: 0, joins: 0/0}
lvl5_r9QH
lvl5_r9QH
  = \ s1_a8on ->
      case newByteArray# 10# s1_a8on of { (# ipv_a8or, ipv1_a8os #) ->
      $wouter
        ipv1_a8os 4# $s$fHasServerTYPE:<|>context_$croute4 0# ipv_a8or
      }

-- RHS size: {terms: 4, types: 2, coercions: 0, joins: 0/0}
lvl7_r9QJ
lvl7_r9QJ = ++_$s++ [] $fAcceptTYPEJSON4 $fAcceptTYPEJSON2

-- RHS size: {terms: 2, types: 11, coercions: 0, joins: 0/0}
$s$fAllCTRender:a_$callMime
$s$fAllCTRender:a_$callMime = \ _ -> lvl7_r9QJ

-- RHS size: {terms: 4, types: 2, coercions: 15, joins: 0/0}
$sencode
$sencode
  = \ eta_a9of ->
      toLazyByteString ((string1 eta_a9of) `cast` <Co:15>)

-- RHS size: {terms: 2, types: 4, coercions: 0, joins: 0/0}
lvl8_r9QK
lvl8_r9QK = \ _ -> $sencode

-- RHS size: {terms: 3, types: 5, coercions: 0, joins: 0/0}
$s$fAllCTRender:a_$s$fMimeRenderTYPEJSONa
$s$fAllCTRender:a_$s$fMimeRenderTYPEJSONa
  = C:MimeRender $fAcceptTYPEJSON lvl8_r9QK

-- RHS size: {terms: 5, types: 16, coercions: 0, joins: 0/0}
$s$fAllMimeRender:a0_$callMimeRender
$s$fAllMimeRender:a0_$callMimeRender
  = \ _ w2_a9nu ->
      $w$callMimeRender1
        $s$fAllCTRender:a_$s$fMimeRenderTYPEJSONa w2_a9nu

-- RHS size: {terms: 3, types: 9, coercions: 9, joins: 0/0}
$s$fAllCTRender:a_$s$fAllMimeRender:a0
$s$fAllCTRender:a_$s$fAllMimeRender:a0
  = C:AllMimeRender
      (lvl3_r9QF `cast` <Co:9>) $s$fAllMimeRender:a0_$callMimeRender

-- RHS size: {terms: 21, types: 30, coercions: 0, joins: 0/0}
$wlvl_r9QL
$wlvl_r9QL
  = \ ww_s9J5 ww1_s9J6 ww2_s9J7 ww3_s9J8 w_s9J2 ->
      case $w$sparseQuality2 ww_s9J5 ww1_s9J6 ww2_s9J7 ww3_s9J8 of {
        Nothing -> Nothing;
        Just x1_a9nm ->
          mapQuality_$smapQuality2
            (map
               $fAllCTRender:a1
               ($w$callMimeRender1
                  $s$fAllCTRender:a_$s$fMimeRenderTYPEJSONa w_s9J2))
            x1_a9nm
      }

-- RHS size: {terms: 12, types: 19, coercions: 1, joins: 0/0}
lvl9_r9QM
lvl9_r9QM
  = \ _ w1_s9J1 w2_s9J2 ->
      case w1_s9J1 `cast` <Co:1> of
      { PS ww1_s9J5 ww2_s9J6 ww3_s9J7 ww4_s9J8 ->
      $wlvl_r9QL ww1_s9J5 ww2_s9J6 ww3_s9J7 ww4_s9J8 w2_s9J2
      }

-- RHS size: {terms: 3, types: 9, coercions: 9, joins: 0/0}
$s$fAllCTRender:a
$s$fAllCTRender:a
  = C:AllCTRender
      ($s$fAllCTRender:a_$callMime `cast` <Co:9>) lvl9_r9QM

-- RHS size: {terms: 6, types: 33, coercions: 11, joins: 0/0}
lvl11_r9QO
lvl11_r9QO
  = \ @ env_a8nX ->
      $w$croute13
        $s$fAllCTRender:a
        ($fReflectMethodStdMethodGET_$creflectMethod `cast` <Co:4>)
        ($s$fHasServerTYPE:<|>context_$croute3 `cast` <Co:7>)
        Proxy

-- RHS size: {terms: 23, types: 113, coercions: 18, joins: 0/0}
$s$fHasServerTYPE:>context4_$croute
$s$fHasServerTYPE:>context4_$croute
  = \ @ env_a8nX w2_a8nY _ w4_a8o0 ->
      case w2_a8nY of { Proxy ->
      StaticRouter
        (case runRW# lvl4_r9QG of { (# ipv_a8ow, ipv1_a8ox #) ->
         case ipv1_a8ox of dt_a8oB { Text ipv2_a8Hq ipv3_a8Hr ipv4_a8Hs ->
         Bin 1# dt_a8oB (lvl11_r9QO (w4_a8o0 `cast` <Co:18>)) Tip Tip
         }
         })
        []
      }

-- RHS size: {terms: 23, types: 113, coercions: 18, joins: 0/0}
$s$fHasServerTYPE:>context4_$croute1
$s$fHasServerTYPE:>context4_$croute1
  = \ @ env_a8nX w2_a8nY _ w4_a8o0 ->
      case w2_a8nY of { Proxy ->
      StaticRouter
        (case runRW# lvl5_r9QH of { (# ipv_a8ow, ipv1_a8ox #) ->
         case ipv1_a8ox of dt_a8oB { Text ipv2_a8Hq ipv3_a8Hr ipv4_a8Hs ->
         Bin 1# dt_a8oB (lvl11_r9QO (w4_a8o0 `cast` <Co:18>)) Tip Tip
         }
         })
        []
      }

-- RHS size: {terms: 7, types: 6, coercions: 0, joins: 0/0}
app4
app4 = \ _ _ _ _ _ _ -> app_result

-- RHS size: {terms: 10, types: 13, coercions: 260, joins: 0/0}
app3
app3
  = Delayed
      (emptyDelayed4 `cast` <Co:34>)
      (emptyDelayed2 `cast` <Co:32>)
      (emptyDelayed2 `cast` <Co:32>)
      (emptyDelayed2 `cast` <Co:32>)
      (emptyDelayed2 `cast` <Co:32>)
      (emptyDelayed2 `cast` <Co:32>)
      (emptyDelayed2 `cast` <Co:32>)
      (emptyDelayed1 `cast` <Co:34>)
      app4

-- RHS size: {terms: 6, types: 31, coercions: 0, joins: 0/0}
app2
app2
  = $w$croute
      $s$fHasServerTYPE:>context4_$croute1
      $s$fHasServerTYPE:>context4_$croute
      Proxy
      $WEmptyContext
      app3

-- RHS size: {terms: 3, types: 1, coercions: 0, joins: 0/0}
app1
app1 = runRouterEnv app2 ()

-- RHS size: {terms: 2, types: 0, coercions: 0, joins: 0/0}
app
app = toApplication app1


------ Local rules for imported ids --------
"SPEC/API_2 $fHasServerTYPE:>context4_$croute @ "2" @ (Verb
                                                        'GET 200 '[JSON] String) @ '[]" [2]
    forall w1_s8Hz w_s8Hy.
      $fHasServerTYPE:>context4_$croute w_s8Hy w1_s8Hz
      = $s$fHasServerTYPE:>context4_$croute
"SPEC/API_2 $fHasServerTYPE:>context4_$croute @ "1" @ (Verb
                                                        'GET 200 '[JSON] String) @ '[]" [2]
    forall w1_s8Hw w_s8Hv.
      $fHasServerTYPE:>context4_$croute w_s8Hv w1_s8Hw
      = $s$fHasServerTYPE:>context4_$croute1
"SPEC/API_2 $fAllCTRender:a_$callMime @ JSON @ '[]" [2]
    forall w1_s9j3 w_s9j2.
      $fAllCTRender:a_$callMime w_s9j2 w1_s9j3
      = $s$fAllCTRender:a_$callMime
"SPEC/API_2 $fAllCTRender:a @ JSON @ '[] @ [Char]"
    forall v2_s90i v1_s90h v_s90g.
      $fAllCTRender:a v_s90g v1_s90h v2_s90i
      = $s$fAllCTRender:a
"SPEC/API_2 $fAllMimeRender:a0_$callMimeRender @ JSON @ [Char]" [2]
    forall w_s9nT.
      $fAllMimeRender:a0_$callMimeRender w_s9nT
      = $s$fAllMimeRender:a0_$callMimeRender
"SPEC/API_2 $fAllMimeRender:a0 @ JSON @ [Char]"
    forall v_s9np.
      $fAllMimeRender:a0 v_s9np
      = $s$fAllCTRender:a_$s$fAllMimeRender:a0
"SPEC/API_2 encode @ [Char]"
    forall $dToJSON_s9om. encode $dToJSON_s9om = $sencode
"SPEC/API_2 $fMimeRenderTYPEJSONa @ [Char]"
    forall v_s9oc.
      $fMimeRenderTYPEJSONa v_s9oc
      = $s$fAllCTRender:a_$s$fMimeRenderTYPEJSONa

I'm not sure GHC does anything particularly wrong here, but if anyone sees something, let me know. Perhaps the servant libraries could give better hints so as to make sure users don't end up with gigantic Core. It'd be interesting to see on what kind of Core terms the blowup happens.


Whole core stats:

*** Parser [API_50]:
!!! Parser [API_50]: finished in 7.07 milliseconds, allocated 2.437 megabytes
*** Renamer/typechecker [API_50]:
!!! Renamer/typechecker [API_50]: finished in 731.06 milliseconds, allocated 104.368 megabytes
*** Desugar [API_50]:
Result size of Desugar (before optimization)
  = {terms: 942, types: 39,926, coercions: 18,615, joins: 0/161}
Result size of Desugar (after optimization)
  = {terms: 828, types: 39,801, coercions: 18,815, joins: 0/154}
!!! Desugar [API_50]: finished in 91.34 milliseconds, allocated 12.113 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
  = {terms: 1,352, types: 45,874, coercions: 38,442, joins: 0/1}
Result size of Simplifier iteration=2
  = {terms: 1,060, types: 29,382, coercions: 38,442, joins: 0/1}
Result size of Simplifier
  = {terms: 1,060, types: 29,382, coercions: 38,442, joins: 0/1}
!!! Simplifier [API_50]: finished in 498.86 milliseconds, allocated 46.616 megabytes
*** Specialise [API_50]:
Result size of Specialise
  = {terms: 15,554, types: 810,695, coercions: 85,830, joins: 0/3}
!!! Specialise [API_50]: finished in 333.94 milliseconds, allocated 43.001 megabytes
*** Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [API_50]:
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = False})
  = {terms: 29,526, types: 1,215,461, coercions: 85,830, joins: 0/0}
!!! Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = False}) [API_50]: finished in 1163.50 milliseconds, allocated 133.012 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
  = {terms: 29,138, types: 1,324,321, coercions: 111,537, joins: 0/0}
Result size of Simplifier iteration=2
  = {terms: 144,992,
     types: 5,290,053,
     coercions: 1,377,705,
     joins: 0/0}
Result size of Simplifier iteration=3
  = {terms: 144,792,
     types: 5,286,303,
     coercions: 1,370,505,
     joins: 0/0}
Result size of Simplifier
  = {terms: 144,792,
     types: 5,286,303,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 31122.77 milliseconds, allocated 4832.778 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
Result size of Simplifier
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 14560.75 milliseconds, allocated 2147.932 megabytes
*** Simplifier [API_50]:
Result size of Simplifier
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 6388.17 milliseconds, allocated 1072.456 megabytes
*** Float inwards [API_50]:
Result size of Float inwards
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Float inwards [API_50]: finished in 660.78 milliseconds, allocated 139.137 megabytes
*** Called arity analysis [API_50]:
Result size of Called arity analysis
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Called arity analysis [API_50]: finished in 617.06 milliseconds, allocated 92.774 megabytes
*** Simplifier [API_50]:
Result size of Simplifier
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 7205.29 milliseconds, allocated 1072.456 megabytes
*** Demand analysis [API_50]:
Result size of Demand analysis
  = {terms: 144,678,
     types: 5,285,913,
     coercions: 1,370,505,
     joins: 0/0}
!!! Demand analysis [API_50]: finished in 7360.82 milliseconds, allocated 3239.076 megabytes
*** Worker Wrapper binds [API_50]:
Result size of Worker Wrapper binds
  = {terms: 149,098,
     types: 5,686,426,
     coercions: 1,434,172,
     joins: 0/671}
!!! Worker Wrapper binds [API_50]: finished in 623.29 milliseconds, allocated 44.550 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
  = {terms: 146,708,
     types: 5,488,159,
     coercions: 1,370,905,
     joins: 0/144}
Result size of Simplifier iteration=2
  = {terms: 146,397,
     types: 5,471,833,
     coercions: 1,370,505,
     joins: 0/0}
Result size of Simplifier
  = {terms: 146,397,
     types: 5,471,833,
     coercions: 1,370,505,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 24290.70 milliseconds, allocated 3566.239 megabytes
*** Exitification transformation [API_50]:
Result size of Exitification transformation
  = {terms: 146,397,
     types: 5,471,833,
     coercions: 1,370,505,
     joins: 0/0}
!!! Exitification transformation [API_50]: finished in 612.12 milliseconds, allocated 24.003 megabytes
*** Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = True}) [API_50]:
Result size of Float out(FOS {Lam = Just 0,
                              Consts = True,
                              OverSatApps = True})
  = {terms: 169,649,
     types: 6,797,454,
     coercions: 1,370,505,
     joins: 0/0}
!!! Float out(FOS {Lam = Just 0,
                   Consts = True,
                   OverSatApps = True}) [API_50]: finished in 10644.61 milliseconds, allocated 1420.882 megabytes
*** Common sub-expression [API_50]:
Result size of Common sub-expression
  = {terms: 43,422,
     types: 3,637,259,
     coercions: 1,276,012,
     joins: 0/0}
!!! Common sub-expression [API_50]: finished in 3329.56 milliseconds, allocated 650.697 megabytes
*** Float inwards [API_50]:
Result size of Float inwards
  = {terms: 43,422,
     types: 3,637,259,
     coercions: 1,276,012,
     joins: 0/0}
!!! Float inwards [API_50]: finished in 372.33 milliseconds, allocated 25.084 megabytes
*** Simplifier [API_50]:
Result size of Simplifier iteration=1
  = {terms: 23,508,
     types: 2,336,132,
     coercions: 1,295,370,
     joins: 0/50}
Result size of Simplifier iteration=2
  = {terms: 21,880,
     types: 2,231,518,
     coercions: 1,294,470,
     joins: 0/0}
Result size of Simplifier
  = {terms: 21,880,
     types: 2,231,518,
     coercions: 1,294,470,
     joins: 0/0}
!!! Simplifier [API_50]: finished in 10672.91 milliseconds, allocated 1784.859 megabytes
*** Demand analysis [API_50]:
Result size of Demand analysis
  = {terms: 21,880,
     types: 2,231,518,
     coercions: 1,294,470,
     joins: 0/0}
!!! Demand analysis [API_50]: finished in 5560.72 milliseconds, allocated 2786.071 megabytes
*** CoreTidy [API_50]:
Result size of Tidy Core
  = {terms: 3,507, types: 134,928, coercions: 39,021, joins: 0/0}
!!! CoreTidy [API_50]: finished in 1105.27 milliseconds, allocated 349.239 megabytes
Created temporary directory: /run/user/1001/ghc11842_0
*** CorePrep [API_50]:
Result size of CorePrep
  = {terms: 3,732, types: 134,408, coercions: 39,021, joins: 0/102}
!!! CorePrep [API_50]: finished in 39.19 milliseconds, allocated 6.130 megabytes
*** Stg2Stg:
*** CodeGen [API_50]:
!!! CodeGen [API_50]: finished in 760.67 milliseconds, allocated 146.051 megabytes

$ ghc --version
The Glorious Glasgow Haskell Compilation System, version 8.6.5
$ ghc-pkg list servant*
/nix/store/j86yn21gz4zly8w31hjd9vd5gw9yd2h1-ghc-8.6.5-with-packages/lib/ghc-8.6.5/package.conf.d
    servant-0.15
    servant-server-0.15

alpmestan avatar Aug 08 '19 15:08 alpmestan

Would be interesting to take a look at this one with the GHC profiling story improved.

domenkozar avatar Apr 16 '20 14:04 domenkozar

These issues also happens with -O1. Using -fno-specialize made my 50 route example compile in <1s vs 10s.

I believe the core issue could be related to the 7 year old TypeFamilies painfully slow issue.

If I'm reading trac correctly, a fix related to exponential inlining should be in GHC 9.0.1. I haven't tested this since ~it appears servant isn't building with 9.0.1 yet~ I don't have time to build servant from source right now.

~It might be interesting to see if -fomit-interface-pragmas as suggested in this comment help anything.~ (No meaningful change for me)

codygman avatar Jul 23 '21 03:07 codygman

I don't think the issue is just type families, they're used quite a bit in servant, but not everywhere. Quite likely the problem is similar to what we see with https://gitlab.haskell.org/ghc/ghc/-/issues/5642 and many other tickets, where we have a bunch of values that have different types, and where each sub-expression that combines those values ends up having its own type too. Simply generating the Core to describe all those things ends up taking quite a bit of time and space, and this all has to be processed by every single step of the optimisation pipeline etc. I suspect -ddump-timings or -dshow-passes would help illustrate this.

alpmestan avatar Jul 23 '21 07:07 alpmestan

I don't think the issue is just type families, they're used quite a bit in servant, but not everywhere. Quite likely the problem is similar to what we see with https://gitlab.haskell.org/ghc/ghc/-/issues/5642 and many other tickets, where we have a bunch of values that have different types, and where each sub-expression that combines those values ends up having its own type too. Simply generating the Core to describe all those things ends up taking quite a bit of time and space, and this all has to be processed by every single step of the optimisation pipeline etc. I suspect -ddump-timings or -dshow-passes would help illustrate this.

I took your example from above and expanded it to 37 routes (random copy pasting gave an odd number):

{-# LANGUAGE DataKinds, TypeOperators, TypeFamilies #-}
module API_2 where
import Data.Proxy
import Servant
type API = "1" :> Get '[JSON] String :<|> "2" :> Get '[JSON] String
api = Proxy :: Proxy API
server :: Server API
server = return "1" :<|> return "2"
app = serve api server

Compiling this with -O1 took 10s like:

ghc -O1 -fforce-recomp 37.hs +RTS -s

Turning off pre-inlining took it from 10s to 1s with command:

ghc -O1 -fforce-recomp -fno-pre-inlining 37.hs +RTS -s -RTS

Before I knew to try -fno-pre-inlining I ran -dshow-passes like:

ghc -O1 -fforce-recomp -ddump-timings -dshow-passes 37.hs +RTS -s -RTS

Resulting in:

,*** Simplifier [API_2]:
  Result size of Simplifier iteration=1
    = {terms: 13,936, types: 974,580, coercions: 58,364, joins: 0/41}
  Result size of Simplifier iteration=2
    = {terms: 50,614,
       types: 1,907,589,
       coercions: 571,691,
       joins: 0/699}
  Result size of Simplifier iteration=3
    = {terms: 50,654,
       types: 1,909,075,
       coercions: 572,239,
       joins: 0/76}
  Result size of Simplifier iteration=4
    = {terms: 50,578,
       types: 1,908,885,
       coercions: 572,239,
       joins: 0/38}
  Result size of Simplifier
    = {terms: 50,578,
       types: 1,908,885,
       coercions: 572,239,
       joins: 0/38}
  !!! Simplifier [API_2]: finished in 2286.12 milliseconds, allocated 2415.010 megabytes
  Simplifier [API_2]: alloc=2532321584 time=2286.117

I think that -fno-pre-inlining would be disabling this initial simplifier pass though:

 ,*** Simplifier [API_2]:
  Result size of Simplifier iteration=1
    = {terms: 712, types: 24,452, coercions: 22,028, joins: 0/2}
  Result size of Simplifier iteration=2
    = {terms: 640, types: 16,091, coercions: 22,028, joins: 0/1}
  Result size of Simplifier
    = {terms: 640, types: 16,091, coercions: 22,028, joins: 0/1}
  !!! Simplifier [API_2]: finished in 77.03 milliseconds, allocated 39.149 megabytes
  Simplifier [API_2]: alloc=41050960 time=77.035

Where things grow, but definitely don't blow up. Types blow up in the specialization pass right after:

 ,*** Specialise [API_2]:
  Result size of Specialise
    = {terms: 3,928, types: 374,371, coercions: 39,797, joins: 0/3}
  !!! Specialise [API_2]: finished in 73.37 milliseconds, allocated 108.150 megabytes
  Specialise [API_2]: alloc=113403720 time=73.374

Maybe if I do -dshow-passes with -fno-pre-inlining it will demonstrate this...

Edit: With -fno-pre-inlining there is no blow up in specialization for some reason.

,*** Simplifier [API_2]:
Result size of Simplifier iteration=1
  = {terms: 716, types: 24,946, coercions: 22,028, joins: 0/2}
Result size of Simplifier iteration=2
  = {terms: 714, types: 24,934, coercions: 22,028, joins: 0/1}
Result size of Simplifier
  = {terms: 714, types: 24,934, coercions: 22,028, joins: 0/1}
!!! Simplifier [API_2]: finished in 77.12 milliseconds, allocated 41.014 megabytes
Simplifier [API_2]: alloc=43006656 time=77.116
,*** Specialise [API_2]:
Result size of Specialise
  = {terms: 960, types: 32,776, coercions: 22,457, joins: 0/3}
!!! Specialise [API_2]: finished in 16.61 milliseconds, allocated 15.842 megabytes

Note that types were at 374,371 without -fno-pre-inlining and here they are 32,776.

I'm guessing if someone were to see the core after pre-inlining and after specialization they might be able to say what optimization causes the types blowup?

codygman avatar Jul 23 '21 12:07 codygman

GHC 9 takes things with that example from 10s to 16s, so going in the wrong direction...

Something interesting is the blow up seems to happen in the 3rd chunk of "Simplifier" instead of the inliner though using 9.0. Hopefully GHC 9.2 helps, but servant definitely doesn't build with it right now and I'm not sure how to make this problem more minimal yet.

codygman avatar Jul 23 '21 15:07 codygman

I think you are right @alpmestan, I tried replicating the compile slowdown with mini servant and could not.

I noticed one thing that the minimal version is missing is content types. Is there a way to isolate testing just the content type pieces compile time?

I'm currently trying to figure out if that's possible understand how the content type machinery works with the serve function, but that will take some time for me.

I also wonder if ghc is just slow with type level lists and the operations that seem to be behind a lot of the content-type machinery

codygman avatar Jul 24 '21 17:07 codygman

I noticed one thing that the minimal version is missing is content types. Is there a way to isolate testing just the content type pieces compile time?

Probably it's not about the content types alone, but about adding them to everything else. However, you can grab the Accept/MimeRender/MimeUnrender classes and specify a few dummy content types that you'd instantiate those classes for, if you want to add that to your testing. This is all there is to content types and the serve function too. The HasServer instances for verbs or request bodies end up requiring suitable MimeRender/MimeUnrender instances for all the content types in the list, as you can see e.g here: https://github.com/haskell-servant/servant/blob/master/servant-server/src/Servant/Server/Internal.hs#L294 (the AllCTRender ctypes a constraint).

alpmestan avatar Jul 25 '21 18:07 alpmestan

Just ran the same benchmark as @bergey in the first post on my machine (laptop from like 2014-2015), and then changed all benchmark implementations to using NamedRoutes.

I built everything with lts-18.27, because I was getting panics from GHC 8.2.2. And I ran bench --before 'stack clean' -L 120 'stack build --ghc-options="-O2"'

It seems it fixes the quadratic times in compilation:

number of routes time (sec) Old time (sec) NamedRoutes
1 7.5 8.1
8 9.0 9.2
16 13.2 10.6
32 30.7 13.4
64 141.2 19.3

Vlix avatar Mar 13 '22 01:03 Vlix

Here examples of how the modules look and the complete output of the bench:

Example of the anonymous routes
{-# LANGUAGE TypeOperators #-}
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE OverloadedStrings #-}
module Main where

import Data.Semigroup
import Data.Text (Text)
import           Servant hiding (NotSecure)
import           Servant.Server
import           Network.Wai.Handler.Warp   (defaultSettings, runSettings,
                                             setLogger, setPort)

import qualified Data.Text as T

type MainAPI =
    "static" :> "1" :> Get '[JSON] Text
    :<|> "static" :> "2" :> Get '[JSON] Text
    :<|> "static" :> "3" :> Get '[JSON] Text
    :<|> "static" :> "4" :> Get '[JSON] Text
    :<|> "static" :> "5" :> Get '[JSON] Text
    :<|> "static" :> "6" :> Get '[JSON] Text
    :<|> "static" :> "7" :> Get '[JSON] Text
    :<|> "static" :> "8" :> Get '[JSON] Text

mainServer :: Server MainAPI
mainServer = foo :<|> foo :<|> foo :<|> foo
    :<|> foo :<|> foo :<|> foo :<|> foo

foo = return "foo"

main :: IO ()
main = runSettings defaultSettings $
    serveWithContext (Proxy :: Proxy MainAPI) EmptyContext mainServer
Benchmark output of anonymous routes
/flat-1

benchmarking stack build --ghc-options="-O2"
time                 7.427 s    (7.102 s .. 7.745 s)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 7.458 s    (7.395 s .. 7.551 s)
std dev              98.88 ms   (29.73 ms .. 142.7 ms)
variance introduced by outliers: 16% (moderately inflated)
/flat-8

benchmarking stack build --ghc-options="-O2"
time                 9.024 s    (8.667 s .. 9.838 s)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 9.027 s    (8.806 s .. 9.188 s)
std dev              226.5 ms   (119.0 ms .. 315.0 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-16

benchmarking stack build --ghc-options="-O2"
time                 13.36 s    (12.86 s .. 14.10 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 13.21 s    (12.81 s .. 13.36 s)
std dev              284.8 ms   (53.22 ms .. 372.0 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-32

benchmarking stack build --ghc-options="-O2"
time                 30.73 s    (30.19 s .. 30.98 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 30.72 s    (30.62 s .. 30.80 s)
std dev              108.2 ms   (48.80 ms .. 142.1 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-64

benchmarking stack build --ghc-options="-O2"
time                 144.4 s    (134.1 s .. 159.8 s)
                     0.998 R²   (0.998 R² .. 1.000 R²)
mean                 141.2 s    (138.7 s .. 143.6 s)
std dev              2.836 s    (2.113 s .. 3.384 s)
variance introduced by outliers: 19% (moderately inflated)
Example of NamedRoutes
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE OverloadedStrings #-}
{-# LANGUAGE TypeOperators #-}
module Main where

import Data.Semigroup
import Data.Text (Text)
import GHC.Generics (Generic)
import           Servant hiding (NotSecure)
import           Servant.API.Generic
import           Servant.Server
import           Network.Wai.Handler.Warp   (defaultSettings, runSettings,
                                             setLogger, setPort)

import qualified Data.Text as T

type MainAPI = NamedRoutes API

data API mode = API
    { static1 :: mode :- "static" :> "1" :> Get '[JSON] Text
    , static2 :: mode :- "static" :> "2" :> Get '[JSON] Text
    , static3 :: mode :- "static" :> "3" :> Get '[JSON] Text
    , static4 :: mode :- "static" :> "4" :> Get '[JSON] Text
    , static5 :: mode :- "static" :> "5" :> Get '[JSON] Text
    , static6 :: mode :- "static" :> "6" :> Get '[JSON] Text
    , static7 :: mode :- "static" :> "7" :> Get '[JSON] Text
    , static8 :: mode :- "static" :> "8" :> Get '[JSON] Text
    } deriving Generic

mainServer :: Server MainAPI
mainServer = API
    { static1 = foo
    , static2 = foo
    , static3 = foo
    , static4 = foo
    , static5 = foo
    , static6 = foo
    , static7 = foo
    , static8 = foo
    }

foo = return "foo"

main :: IO ()
main = runSettings defaultSettings $
    serveWithContext (Proxy :: Proxy MainAPI) EmptyContext mainServer

Benchmark output of NamedRoutes
/flat-1

benchmarking stack build --ghc-options="-O2"
time                 7.882 s    (7.136 s .. 8.371 s)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 8.137 s    (7.986 s .. 8.224 s)
std dev              144.6 ms   (40.75 ms .. 192.8 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-8

benchmarking stack build --ghc-options="-O2"
time                 9.142 s    (8.178 s .. 9.885 s)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 9.275 s    (9.155 s .. 9.473 s)
std dev              194.6 ms   (66.72 ms .. 264.1 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-16

benchmarking stack build --ghc-options="-O2"
time                 10.66 s    (9.664 s .. 11.77 s)
                     0.999 R²   (0.995 R² .. 1.000 R²)
mean                 10.56 s    (10.40 s .. 10.76 s)
std dev              195.4 ms   (84.90 ms .. 271.9 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-32

benchmarking stack build --ghc-options="-O2"
time                 13.59 s    (12.54 s .. 14.52 s)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 13.35 s    (13.13 s .. 13.53 s)
std dev              248.5 ms   (82.63 ms .. 338.9 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-64

benchmarking stack build --ghc-options="-O2"
time                 19.27 s    (18.93 s .. 19.82 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 19.33 s    (19.22 s .. 19.40 s)
std dev              114.1 ms   (66.46 ms .. 160.4 ms)
variance introduced by outliers: 19% (moderately inflated)

If anyone else wants to bench the NamedRoutes: Here's my fork You can bench the anonymous routes by checking out this commit: de54845850449acd222cd98aa70fa7003a1a60b3

Vlix avatar Mar 13 '22 16:03 Vlix

It is extremely surprising to me that NamedRoutes would bring performance gains on flat APIs (some gains on nested routes would have been less surprising IMHO).

I am guessing NamedRoutes goes in the way of inlining, which somehow is a good thing here ?

gdeest avatar Mar 13 '22 16:03 gdeest

Is it maybe because the flat API isn't actually flat with the anonymous routes, but nested with every extra :<|>?

Route1 :<|> (Route2 :<|> (Route3 :<|> ...) ) )

Vlix avatar Mar 13 '22 16:03 Vlix

The HasServer instance for NamedRoutes still relies on :<|> under the hood, see here:

https://hackage.haskell.org/package/servant-server-0.19/docs/src/Servant.Server.Internal.html#line-929

We piggy-back on the instance of ToServantApi api which is a vanilla servant API.

gdeest avatar Mar 13 '22 16:03 gdeest

I am guessing NamedRoutes goes in the way of inlining, which somehow is a good thing here ?

2 ideas then:

  • generate the core and diff each version or analyze like done here
  • Run each of the 64 route benchmarks with -dshow-passes and find the largest differences

Maybe there's a better way to say "print all fired optimizations" in a way you could diff though.

ParetoOptimalDev avatar Mar 14 '22 17:03 ParetoOptimalDev

For what it's worth, I was not able to reproduce those benchmark results. For me, named routes were slower than the old way of doing things by a fair bit. You can see the entire benchmark script and results here: https://gist.github.com/tfausak/8019733fb5c703994d1665143c60ad0f

Click to expand example "old" module.
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE TypeOperators #-}

module M ( main ) where

import qualified Network.Wai as Wai
import qualified Network.Wai.Handler.Warp as Warp
import qualified Servant

main :: IO ()
main = Warp.run 3000 application

application :: Wai.Application
application = Servant.serve proxy server

proxy :: Servant.Proxy Api
proxy = Servant.Proxy

server :: Servant.Server Api
server
    = pure 1
    Servant.:<|> pure 2

type Api
    = "1" Servant.:> Servant.Get '[Servant.JSON] Int
    Servant.:<|> "2" Servant.:> Servant.Get '[Servant.JSON] Int
Click to expand example "new" module.
{-# LANGUAGE DataKinds #-}
{-# LANGUAGE DeriveGeneric #-}
{-# LANGUAGE TypeOperators #-}

module M ( main ) where

import qualified Network.Wai as Wai
import qualified Network.Wai.Handler.Warp as Warp
import qualified Servant
import qualified Servant.API.Generic as Servant
import qualified Servant.Server.Generic as Servant
import qualified GHC.Generics as Generics

main :: IO ()
main = Warp.run 3000 application

application :: Wai.Application
application = Servant.genericServe routes

routes :: Routes Servant.AsServer
routes = Routes
    { _1 = pure 1
    , _2 = pure 2
    }

data Routes routes = Routes
    { _1 :: routes Servant.:- "1" Servant.:> Servant.Get '[Servant.JSON] Int
    , _2 :: routes Servant.:- "2" Servant.:> Servant.Get '[Servant.JSON] Int
    } deriving (Generics.Generic)

With 64 routes, the old module compiles in 4.4 seconds compared to 16.2 seconds for the new module.

I know that benchmarking is hard, so perhaps I've done something wrong here. Or maybe my server definitions are subtly different than @Vlix's in some way. But I wasn't able to get any module with named routes compiling faster than an equivalent module with plain old (:<|>).

tfausak avatar Mar 14 '22 20:03 tfausak

@tfausak The only obvious difference between your two benchmarks is that in @Vlix's, all handlers are the same (foo). But maybe does it actually make a difference ?

gdeest avatar Mar 14 '22 21:03 gdeest

Also, I notice that compilation times for the :<|> routes do not seem quadratic at all in the number of routes in your benchmarks.

gdeest avatar Mar 14 '22 21:03 gdeest

@tfausak I'm not completely sure, but it looks like you're building with -O1, where we're all talking about compiling with -O2 (at least that's what the title of this issue states)

EDIT: Their try gave me an idea, though, brb

EDIT2: Ok, it's not done benching, but I think I'm already seeing what happened. I didn't build the "old timing" with servant-0.19 (just servant-0.18.3 because that's in lts-18.27), because it didn't need it (don't need NamedRoutes to bench the old stuff). And I was fine with the results, because they confirmed bergey's initial issue. Then tried with NamedRoutes and servant-0.19 and then contributed it to the NamedRoutes. I tried the "old timing" benchmark with /servant-(server-)?0.19/ in the extra-deps and yeah, the quadratic building times disappear O_o So SOMETHING happened between /servant-(server-)?0.18.3/ and /servant-(server-)?0.19/!

tl;dr MY BAD! It's not NamedRoutes! It's the latest release of servant-(server-)0.19!

Vlix avatar Mar 14 '22 23:03 Vlix

Ok, rebuilt again with just lts-18.27 and then with extra-deps: [servant-0.19, servant-server-0.19] (added the NamedRoutes times again just for reference, did not rebench those)

So yeah... if anyone knows the diff between 0.18.3 and 0.19, please explain :sweat_smile: (it has to be in either servant-0.19 or servant-server-0.19, because those are the only ones different between these two benches)

number of routes time (sec) anonymous routes (0.18.3) time (sec) anonymous routes (0.19) time (sec) NamedRoutes (0.19) -- for reference
1 7.6 7.6 8.1
8 9.2 8.3 9.2
16 13.3 8.8 10.6
32 30.9 10.6 13.4
64 137.8 15.1 19.3
Regular (flat) anonymous routes with servant(-server)-0.18.3
/flat-1

benchmarking stack build --ghc-options="-O2"
time                 7.568 s    (7.233 s .. 7.800 s)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 7.603 s    (7.491 s .. 7.649 s)
std dev              88.08 ms   (16.25 ms .. 120.2 ms)
variance introduced by outliers: 16% (moderately inflated)
/flat-8

benchmarking stack build --ghc-options="-O2"
time                 8.852 s    (8.185 s .. 9.748 s)
                     0.999 R²   (0.996 R² .. NaN R²)
mean                 9.192 s    (9.027 s .. 9.422 s)
std dev              219.0 ms   (86.83 ms .. 280.4 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-16

benchmarking stack build --ghc-options="-O2"
time                 13.17 s    (12.31 s .. 13.69 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 13.26 s    (13.17 s .. 13.35 s)
std dev              117.4 ms   (38.85 ms .. 160.3 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-32

benchmarking stack build --ghc-options="-O2"
time                 30.62 s    (30.06 s .. 31.17 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 30.87 s    (30.73 s .. 30.98 s)
std dev              162.3 ms   (107.5 ms .. 193.3 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-64

benchmarking stack build --ghc-options="-O2"
time                 140.1 s    (131.9 s .. 155.9 s)
                     0.999 R²   (0.997 R² .. 1.000 R²)
mean                 137.8 s    (135.7 s .. 139.5 s)
std dev              2.217 s    (959.9 ms .. 2.979 s)
variance introduced by outliers: 19% (moderately inflated)
Regular (flat) anonymous routes with servant(-server)-0.19
/flat-1

benchmarking stack build --ghc-options="-O2"
time                 7.424 s    (7.307 s .. 7.547 s)
                     1.000 R²   (0.998 R² .. 1.000 R²)
mean                 7.630 s    (7.522 s .. 7.754 s)
std dev              140.9 ms   (104.1 ms .. 175.0 ms)
variance introduced by outliers: 16% (moderately inflated)
/flat-8

benchmarking stack build --ghc-options="-O2"
time                 8.232 s    (7.737 s .. 9.237 s)
                     0.998 R²   (0.997 R² .. 1.000 R²)
mean                 8.264 s    (8.000 s .. 8.464 s)
std dev              272.4 ms   (141.9 ms .. 375.2 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-16

benchmarking stack build --ghc-options="-O2"
time                 8.760 s    (7.966 s .. 9.571 s)
                     0.999 R²   (0.996 R² .. 1.000 R²)
mean                 8.797 s    (8.600 s .. 8.876 s)
std dev              136.9 ms   (40.51 ms .. 183.0 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-32

benchmarking stack build --ghc-options="-O2"
time                 10.65 s    (9.727 s .. 11.22 s)
                     0.999 R²   (0.998 R² .. 1.000 R²)
mean                 10.59 s    (10.34 s .. 10.73 s)
std dev              240.8 ms   (119.2 ms .. 306.2 ms)
variance introduced by outliers: 19% (moderately inflated)
/flat-64

benchmarking stack build --ghc-options="-O2"
time                 14.81 s    (14.36 s .. 15.46 s)
                     1.000 R²   (0.999 R² .. 1.000 R²)
mean                 15.11 s    (14.93 s .. 15.40 s)
std dev              269.4 ms   (1.918 ms .. 330.8 ms)
variance introduced by outliers: 19% (moderately inflated)

Vlix avatar Mar 15 '22 00:03 Vlix

You could run git bisect with a script for 32 routes that fails if it takes more than 20 seconds to narrow things down if there are a lot of commits.

codygman avatar Mar 15 '22 00:03 codygman

tl;dr MY BAD! It's not NamedRoutes! It's the latest release of servant-(server-)0.19!

The plot thickens :-) Off the top of my head, I can't think of any change in 0.19 that could be responsible for such improvements. I am going to try bissecting on my end.

gdeest avatar Mar 15 '22 10:03 gdeest

@Vlix I am not able to reproduce your results either. Compilation times for a module of 64 anonymous routes seems pretty much the same on HEAD and on 0.18.3:

de923fc887f9e0197c8795fcb0e6967b196e0e17 (HEAD):

benchmarking cabal build flat-64
time                 5.153 s    (5.121 s .. NaN s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.137 s    (5.131 s .. 5.147 s)
std dev              9.297 ms   (417.9 μs .. 11.61 ms)
variance introduced by outliers: 19% (moderately inflated)

e2b897d3c0655a84551532944f9cffaa9daa8168 (Prepare servant-0.18.3 release):

benchmarking cabal build flat-64
time                 5.047 s    (5.027 s .. 5.076 s)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.036 s    (5.032 s .. 5.041 s)
std dev              6.011 ms   (1.940 ms .. 8.131 ms)
variance introduced by outliers: 19% (moderately inflated)

This is using GHC 8.10.4 though.

gdeest avatar Mar 15 '22 11:03 gdeest

I am able to reproduce the performance difference between Servant 0.18.3 and Servant 0.19. I'm using the same setup as before, which is with Stackage LTS 18.28, which uses GHC 8.10.7.

Number of routes Servant 0.18 build time (seconds) Servant 0.19 build time (seconds)
8 1.09 0.92
16 2.73 1.20
32 10.83 1.90
64 63.87 4.44

Comparing the output of stack ls dependencies, this is all that shows up in the diff:

46a47
> constraints 0.13.3
114,115c115,116
< servant 0.18.3
< servant-server 0.18.3
---
> servant 0.19
> servant-server 0.19
142a144
> type-equality 1

tfausak avatar Mar 15 '22 12:03 tfausak

Just tried with GHC 8.10.7 in a slightly modified servant repository — still can't reproduce the huge gains witnessed by some, so obviously I am not able to git bissect. If anything, those gains seem quite brittle, but it would be good to get some understanding on what is going on here.

@ParetoOptimalDev It looks like you intended to go down that route, did you by any chance figure it out ?

gdeest avatar Mar 21 '22 14:03 gdeest

Just tried with GHC 8.10.7 in a slightly modified servant repository — still can't reproduce the huge gains witnessed by some, so obviously I am not able to git bissect. If anything, those gains seem quite brittle, but it would be good to get some understanding on what is going on here.

@ParetoOptimalDev It looks like you intended to go down that route, did you by any chance figure it out ?

Are you using a Stack LTS? Or are you using cabal? :thinking: (Could you maybe literally share the commands you run to do the benchmarking?)

Vlix avatar Mar 21 '22 15:03 Vlix

Are you using a Stack LTS? Or are you using cabal?

This is using cabal, indeed, and not stack. We're gonna try and see what happens with stack.

I am benching with something along those lines:

cabal build --only-dependencies flat-64
bench -o results/(git rev-parse HEAD).html  --before 'rm -rf ./dist-newstyle/build/x86_64-linux/ghc-8.10.7/flat64-0.1.0.0' 'cabal build flat-64'

where flat-64 is as in the benchmarks linked above (but with -O2 forcefully added to ghc-options in the Cabal file).

My changes are in this commit: https://github.com/haskell-servant/servant/commit/142681c9173ff72d2b1060025ceadeebba1aa8e8

If you use nix, you may run the tests in the same environment as I using nix-shell nix/shell2.nix.

where flat-64

gdeest avatar Mar 21 '22 15:03 gdeest