Idris2 icon indicating copy to clipboard operation
Idris2 copied to clipboard

JS backend: Performance degradation when using monad stack

Open gergoerdi opened this issue 3 years ago • 3 comments

Ohad asked me at ICFP to try to come up with a tiny example and post this here. The larger context is described at https://unsafeperform.io/blog/2022-07-02-a_small_benchmark_for_functional_languages_targeting_web_browsers/

So anyway, the following Idris 2 program defines perfTest1 and perfTest2 the exact same way, except the first one uses a record of methods whereas the second uses a monad transformer stack:

module Main

import Data.IORef
import Control.Monad.Reader
import System.Clock

untilIO : acc -> (acc -> IO (Either acc a)) -> IO a
untilIO acc0 step = fromPrim $ go acc0
  where
    go : acc -> PrimIO a
    go acc w =
      let MkIORes (Left acc') w' = toPrim (step acc) w
            | MkIORes (Right res) w' => MkIORes res w'
      in go acc' w'

record Methods where
  constructor MkMethods
  doRead : IO Int
  doWrite : Int -> IO ()

step1 : Methods => IO Int
step1 = do
  x <- doRead %search
  doWrite %search $ x + 1
  pure x

perfTest1 : Nat -> IO Int
perfTest1 n = do
  m <- do
    r <- newIORef 0
    pure $ MkMethods (readIORef r) (writeIORef r)
  untilIO n $ \n => do
    x <- step1
    case n of
      S n' => pure $ Left n'
      Z => pure $ Right x

interface Monad m => Interface m where
  doRead : m Int
  doWrite : Int -> m ()

Interface (ReaderT (IORef Int) IO) where
  doRead = do
    ref <- ask
    readIORef ref
  doWrite x = do
    ref <- ask
    writeIORef ref x

step2 : Interface m => m Int
step2 = do
  x <- doRead
  doWrite $ x + 1
  pure x

perfTest2 : Nat -> IO Int
perfTest2 n = do
  r <- newIORef (the Int 0)
  untilIO n $ \n => runReaderT r $ the (ReaderT (IORef Int) IO (Either Nat Int)) $ do
    x <- step2
    case n of
      S n' => pure $ Left n'
      Z => pure $ Right x

main : IO ()
main = do
  print =<< perfTest1 10000000
  -- print =<< perfTest2 10000000

Compiling with the JS backend and running it with Node.js, we can try uncommenting either one of the two print lines in main. The difference is quite large:

perfTest1:

real	0m1.137s
user	0m1.268s
sys	0m0.050s


perfTest2:

real	0m6.811s
user	0m7.007s
sys	0m0.067s

This is completely unscientific, we don't account for node startup time, or JIT warmup, or anything, but still we can at least qualitatively see the difference.

gergoerdi avatar Sep 20 '22 09:09 gergoerdi

First: This is unrelated to the JS backends. The same performance drop can be observed on the default backend. Second: Do we really expect Idris to optimize away arbitrary transformer stacks? If yes, is there any literature about how to do that? Can Haskell do this?

Anyway: Here are three smaller examples I posted recently on discord:

First: Using ST, which is a wrapper around IO, with its own Monad implementation:

module ST

import Data.IORef
import Control.Monad.ST

iters : Nat
iters = 100000000

sumST : Nat -> STRef s Nat -> ST s Nat
sumST 0     ref = readSTRef ref
sumST (S k) ref = do
  sum <- readSTRef ref
  writeSTRef ref (sum + S k)
  sumST k ref

stApp : Nat -> ST s Nat
stApp n = do
  ref <- newSTRef 0
  sumST n ref

main : IO ()
main = printLn (runST $ stApp iters)

Second: Just IO:

module ST

import Data.IORef
import Control.Monad.ST

sumIO : Nat -> IORef Nat -> IO Nat
sumIO 0     ref = readIORef ref
sumIO (S k) ref = do
  sum <- readIORef ref
  _   <- writeIORef ref (sum + S k)
  sumIO k ref

iters : Nat
iters = 100000000

main : IO ()
main = do
  ref <- newIORef Z
  s   <- sumIO iters ref
  printLn s

And finally, a tail-recursive version with PrimIO:

module ST

import Data.IORef
import Control.Monad.ST

iters : Nat
iters = 100000000

-- Implemented externally
-- e.g., in Scheme, passed around as a box
data Mut : Type -> Type where [external]

%extern prim__newIORef : forall a . a -> (1 x : %World) -> IORes (Mut a)
%extern prim__readIORef : forall a . Mut a -> (1 x : %World) -> IORes a
%extern prim__writeIORef : forall a . Mut a -> (1 val : a) -> (1 x : %World) -> IORes ()

sumPrim : Nat -> Mut Nat -> PrimIO Nat
sumPrim 0     mut w = prim__readIORef mut w
sumPrim (S k) mut w =
  let MkIORes sum w2 := prim__readIORef mut w
      MkIORes ()  w3 := prim__writeIORef mut (sum + S k) w2
   in sumPrim k mut w3

primApp : Nat -> IO Nat
primApp n = do
  mut <- fromPrim (prim__newIORef Z)
  fromPrim (sumPrim n mut)

main : IO ()
main = primApp iters >>= printLn

The first two take between 1.7 and 2.0 seconds to run on my machine. The last runs in 0.3 seconds. This shows that even the IO variant suffers from a considerable overhead.

stefan-hoeck avatar Sep 20 '22 12:09 stefan-hoeck

Can Haskell do this?

If I understand the context correctly, i.e. when we are taking about concrete stacks, Haskell does optimize them away through aggressive inlining, as it was shown in this great talk.

buzden avatar Sep 20 '22 12:09 buzden

First: This is unrelated to the JS backends. The same performance drop can be observed on the default backend. Second: Do we really expect Idris to optimize away arbitrary transformer stacks? If yes, is there any literature about how to do that? Can Haskell do this?

I don't think Haskell specifies any performance-related behaviour for typeclasses. The GHC implementation, however, absolutely can -- the keyword to look up is specialization. Basically, although step2 is polymorphic in the choice of m, its call site in perfTest2 is monomorphic at ReaderT (IORef Int) IO, so with optimizations turned on, GHC wil generate a specialized verison of step2 :: ReaderT (IORef Int) IO, and then further optimizations inline enough of the Interface (ReaderT (IORef Int) IO) methods that everything proceeds as if we didn't use any typeclass-polymorphic functions at all.

gergoerdi avatar Sep 20 '22 13:09 gergoerdi