Idris2
Idris2 copied to clipboard
JS backend: Performance degradation when using monad stack
Ohad asked me at ICFP to try to come up with a tiny example and post this here. The larger context is described at https://unsafeperform.io/blog/2022-07-02-a_small_benchmark_for_functional_languages_targeting_web_browsers/
So anyway, the following Idris 2 program defines perfTest1 and perfTest2 the exact same way, except the first one uses a record of methods whereas the second uses a monad transformer stack:
module Main
import Data.IORef
import Control.Monad.Reader
import System.Clock
untilIO : acc -> (acc -> IO (Either acc a)) -> IO a
untilIO acc0 step = fromPrim $ go acc0
where
go : acc -> PrimIO a
go acc w =
let MkIORes (Left acc') w' = toPrim (step acc) w
| MkIORes (Right res) w' => MkIORes res w'
in go acc' w'
record Methods where
constructor MkMethods
doRead : IO Int
doWrite : Int -> IO ()
step1 : Methods => IO Int
step1 = do
x <- doRead %search
doWrite %search $ x + 1
pure x
perfTest1 : Nat -> IO Int
perfTest1 n = do
m <- do
r <- newIORef 0
pure $ MkMethods (readIORef r) (writeIORef r)
untilIO n $ \n => do
x <- step1
case n of
S n' => pure $ Left n'
Z => pure $ Right x
interface Monad m => Interface m where
doRead : m Int
doWrite : Int -> m ()
Interface (ReaderT (IORef Int) IO) where
doRead = do
ref <- ask
readIORef ref
doWrite x = do
ref <- ask
writeIORef ref x
step2 : Interface m => m Int
step2 = do
x <- doRead
doWrite $ x + 1
pure x
perfTest2 : Nat -> IO Int
perfTest2 n = do
r <- newIORef (the Int 0)
untilIO n $ \n => runReaderT r $ the (ReaderT (IORef Int) IO (Either Nat Int)) $ do
x <- step2
case n of
S n' => pure $ Left n'
Z => pure $ Right x
main : IO ()
main = do
print =<< perfTest1 10000000
-- print =<< perfTest2 10000000
Compiling with the JS backend and running it with Node.js, we can try uncommenting either one of the two print lines in main. The difference is quite large:
perfTest1:
real 0m1.137s
user 0m1.268s
sys 0m0.050s
perfTest2:
real 0m6.811s
user 0m7.007s
sys 0m0.067s
This is completely unscientific, we don't account for node startup time, or JIT warmup, or anything, but still we can at least qualitatively see the difference.
First: This is unrelated to the JS backends. The same performance drop can be observed on the default backend. Second: Do we really expect Idris to optimize away arbitrary transformer stacks? If yes, is there any literature about how to do that? Can Haskell do this?
Anyway: Here are three smaller examples I posted recently on discord:
First: Using ST, which is a wrapper around IO, with its own Monad implementation:
module ST
import Data.IORef
import Control.Monad.ST
iters : Nat
iters = 100000000
sumST : Nat -> STRef s Nat -> ST s Nat
sumST 0 ref = readSTRef ref
sumST (S k) ref = do
sum <- readSTRef ref
writeSTRef ref (sum + S k)
sumST k ref
stApp : Nat -> ST s Nat
stApp n = do
ref <- newSTRef 0
sumST n ref
main : IO ()
main = printLn (runST $ stApp iters)
Second: Just IO:
module ST
import Data.IORef
import Control.Monad.ST
sumIO : Nat -> IORef Nat -> IO Nat
sumIO 0 ref = readIORef ref
sumIO (S k) ref = do
sum <- readIORef ref
_ <- writeIORef ref (sum + S k)
sumIO k ref
iters : Nat
iters = 100000000
main : IO ()
main = do
ref <- newIORef Z
s <- sumIO iters ref
printLn s
And finally, a tail-recursive version with PrimIO:
module ST
import Data.IORef
import Control.Monad.ST
iters : Nat
iters = 100000000
-- Implemented externally
-- e.g., in Scheme, passed around as a box
data Mut : Type -> Type where [external]
%extern prim__newIORef : forall a . a -> (1 x : %World) -> IORes (Mut a)
%extern prim__readIORef : forall a . Mut a -> (1 x : %World) -> IORes a
%extern prim__writeIORef : forall a . Mut a -> (1 val : a) -> (1 x : %World) -> IORes ()
sumPrim : Nat -> Mut Nat -> PrimIO Nat
sumPrim 0 mut w = prim__readIORef mut w
sumPrim (S k) mut w =
let MkIORes sum w2 := prim__readIORef mut w
MkIORes () w3 := prim__writeIORef mut (sum + S k) w2
in sumPrim k mut w3
primApp : Nat -> IO Nat
primApp n = do
mut <- fromPrim (prim__newIORef Z)
fromPrim (sumPrim n mut)
main : IO ()
main = primApp iters >>= printLn
The first two take between 1.7 and 2.0 seconds to run on my machine. The last runs in 0.3 seconds. This shows that even the IO variant suffers from a considerable overhead.
Can Haskell do this?
If I understand the context correctly, i.e. when we are taking about concrete stacks, Haskell does optimize them away through aggressive inlining, as it was shown in this great talk.
First: This is unrelated to the JS backends. The same performance drop can be observed on the default backend. Second: Do we really expect Idris to optimize away arbitrary transformer stacks? If yes, is there any literature about how to do that? Can Haskell do this?
I don't think Haskell specifies any performance-related behaviour for typeclasses. The GHC implementation, however, absolutely can -- the keyword to look up is specialization. Basically, although step2 is polymorphic in the choice of m, its call site in perfTest2 is monomorphic at ReaderT (IORef Int) IO, so with optimizations turned on, GHC wil generate a specialized verison of step2 :: ReaderT (IORef Int) IO, and then further optimizations inline enough of the Interface (ReaderT (IORef Int) IO) methods that everything proceeds as if we didn't use any typeclass-polymorphic functions at all.