QuartoNotebookRunner.jl Seeded RNGs only work reproducibly within a cell

Description

Calling Random.seed! at the top of a notebook doesn't ensure reproducibility of random number generation in other cells; it only works within the same cell that seed! is called.

I'm not sure if this is intentional, but it's slightly counterintuitive behaviour so if it's intentional it might be good to document it somewhere.

A related issue (and perhaps an example of how this behaviour is confusing!) is that the Pumas tutorial on reproducibility says that the second batch of samples in Random.seed!(123); rand(3); rand(3) should always result in "the same behavior [...] every time we run the code". This is true if done in a REPL, but the Quarto source has the second rand(3) in a different cell and consequently the second rand(3) will not always return the same numbers.

MWE

The following notebook works to demonstrate the behaviour. The first call to randn() should always give the same result no matter how many times the notebook is rendered. The second call however will not.

---
engine: julia
---

```{julia}
versioninfo()
```

```{julia}
using Random
Random.seed!(42)
randn()
```

```{julia}
randn()
```

My `versioninfo` from the notebook above

Julia Version 1.10.5
Commit 6f3fdf7b362 (2024-08-27 14:19 UTC)
Build Info:
  Official https://julialang.org/ release
Platform Info:
  OS: macOS (arm64-apple-darwin22.4.0)
  CPU: 10 × Apple M1 Pro
  WORD_SIZE: 64
  LIBM: libopenlibm
  LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
  JULIA_LOAD_PATH = @:@stdlib

Oct 10 '24 23:10 penelopeysm

I think this might be an upstream issue but I'm not too familiar with Malt so maybe it's intended and/or could be avoided:

julia> using Malt

julia> w = Malt.Worker();

julia> Malt.remote_eval_fetch(w, quote
           using Random
           Random.seed!(42)
           randn()
       end)
0.7883556016042917

julia> Malt.remote_eval_fetch(w, quote
           randn()
       end)
0.8653602653617717

julia> Malt.remote_eval_fetch(w, quote
           using Random
           Random.seed!(42)
           randn()
       end)
0.7883556016042917

julia> Malt.remote_eval_fetch(w, quote
           randn()
       end)
-0.6715764280647957

Oct 11 '24 07:10 devmotion

Will investigate, thanks for the thorough issue description @penelopeysm.

Oct 11 '24 07:10 MichaelHatherly

Maybe this has something to do with task-local RNG state? Could be that Malt evals in different tasks per fetch and that this has an effect on the RNG. If they were doing some random number generation themselves that changed the RNG state, then it should still lead to the same numbers if it was the same n operations between two remote_eval_fetches so that can't be it.

Oct 11 '24 07:10 jkrumbiegel

Hmm, the RNG object id seems to be stable, so it seems to be the same RNG?

julia> Malt.remote_eval_fetch(w, quote
           objectid(Random.default_rng())
       end)
0xffffffff0968f0b4

julia> Malt.remote_eval_fetch(w, quote
           objectid(Random.default_rng())
       end)
0xffffffff0968f0b4

julia> Malt.remote_eval_fetch(w, quote
           objectid(Random.default_rng())
       end)
0xffffffff0968f0b4

julia> Malt.remote_eval_fetch(w, quote
           objectid(Random.default_rng())
       end)
0xffffffff0968f0b4

Oct 11 '24 07:10 devmotion

I think that's because it's TaskLocalRNG and that's a singleton type:

julia> TaskLocalRNG() === TaskLocalRNG()
true

The RNG state for that is stored in the task itself.

Oct 11 '24 08:10 jkrumbiegel

cc @fonsp in case you happen to have any insights.

Oct 11 '24 08:10 MichaelHatherly

This shows that we get different task-local RNG state in a new remote_eval_fetch:

julia> Malt.remote_eval_fetch(w, quote
           using Random
           Random.seed!(123)

           rngfields = filter(fieldnames(Task)) do field
               startswith(string(field), "rng")
           end
           state = NamedTuple(map(rngfields) do field
               field => getfield(Base.current_task(), field)
           end)
       end)
(rngState0 = 0xfefa8d41b8f5dca5, rngState1 = 0xf80cc98e147960c1, rngState2 = 0x20e2ccc17662fc1d, rngState3 = 0xea7a7dcb2e787c01, rngState4 = 0xf4e85a418b9c4f80)

julia> Malt.remote_eval_fetch(w, quote
           rngfields = filter(fieldnames(Task)) do field
               startswith(string(field), "rng")
           end
           state = NamedTuple(map(rngfields) do field
               field => getfield(Base.current_task(), field)
           end)
       end)
(rngState0 = 0x4e291591396795b9, rngState1 = 0xc9a680b08b4c9006, rngState2 = 0x3bb5088d1c600f58, rngState3 = 0xe2bb063e7e800a67, rngState4 = 0xe264ff418d928577)

Probably because each invocation is a different task?

julia> Malt.remote_eval_fetch(w, quote
           objectid(Base.current_task())
       end)
0xb68e01fb526e2337

julia> Malt.remote_eval_fetch(w, quote
           objectid(Base.current_task())
       end)
0x49246d797797f2ad

Oct 11 '24 08:10 jkrumbiegel

If this is hard to fix in Malt, I guess we could store the rngState fields after a task runs a cell, and then first thing in the next cell, mutate that one's rngState fields so they take the same values. Not sure if Malt can be configured to use only a single task.

Oct 11 '24 08:10 jkrumbiegel

Maybe one could take some inspiration from how Test resets the task-local RNG: https://github.com/JuliaLang/julia/blob/1438b1578941d0f1cc8f8c958cf3bd2927fd482c/stdlib/Test/src/Test.jl#L1694-L1720

Oct 11 '24 09:10 devmotion

Interesting, it seems this looked quite different on 1.10 where Random.get_tls_seed() did not exist

https://github.com/JuliaLang/julia/blob/6f3fdf7b36250fb95f512a2b927ad2518c07d2b5/stdlib/Test/src/Test.jl#L1567-L1578

Oct 11 '24 09:10 jkrumbiegel

@Pangoraw knows best about RNGs and threading in Pluto :)

Oct 11 '24 13:10 fonsp

remote_eval_fetch indeed spawns a new task at each call. One solution would be for QuartoNotebookWorker to manage its own persistent task for evaluation and then communicate with this task using channels.

Oct 11 '24 14:10 Pangoraw

The more I thought about this issue, the more I got the feeling that this is an upstream issue in Malt. The behaviour seems completely surprising, in particular given that everything else is consistent between subsequent Malt.remote_eval_fetch calls and there were not tasks involved in the to-be-evaluated expressions sent to the worker (additionally, in my example above the worker used a single thread). Additionally, it seemed the behaviour was not documented, so I assumed possibly it was not even intentional.

But then I tried the same example with Distributed:

julia> using Distributed

julia> addprocs(1)
1-element Vector{Int64}:
 2

julia> remotecall_fetch(eval, 2, quote
           using Random
           Random.seed!(42)
           randn()
       end)
0.7883556016042917

julia> remotecall_fetch(eval, 2, quote
           randn()
       end)
0.08641660475950964

julia> remotecall_fetch(eval, 2, quote
           using Random
           Random.seed!(42)
           randn()
       end)
0.7883556016042917

julia> remotecall_fetch(eval, 2, quote
           randn()
       end)
1.559594774368473

julia> remotecall_fetch(eval, 2, quote
           objectid(Base.current_task())
       end)
0xebc117a52ab2a7be

julia> remotecall_fetch(eval, 2, quote
           objectid(Base.current_task())
       end)
0xe14f2bb195cecba7

So it seems Malt is consistent with Distributed. I couldn't find any issue related to it in the Distributed repo but in the Julia repo someone had opened an issue regarding this behaviour: https://github.com/JuliaLang/julia/issues/51225 It also links to a Pluto issue: https://github.com/fonsp/Pluto.jl/issues/2290

So my take away is that the behaviour of Malt is consistent with Distributed but that this behaviour - both in Distributed and in Malt - is confusing and surprising for users. Based on the discussion in the Julia issue, I assume Distributed won't be changed - but since there are issues from both Pluto and Quarto users, maybe it could be addressed in Malt? Maybe Malt could expose an API for evaluations within a persistent task?

Oct 11 '24 17:10 devmotion

This is an approach for Distributed (maybe it can be done more elegantly or efficiently?):

julia> using Distributed

julia> addprocs(1)
1-element Vector{Int64}:
 2

julia> const in = RemoteChannel(() -> Channel{Expr}(0));

julia> const out = RemoteChannel(() -> Channel{Any}(0));

julia> @everywhere function evalloop(out, in)
           while true
               expr = take!(in)
               put!(out, eval(expr))
           end
       end

julia> remote_do(evalloop, 2, out, in)

julia> put!(in, quote
           using Random
           Random.seed!(42)
           randn()
       end);

julia> take!(out)
0.7883556016042917

julia> put!(in, quote
           randn()
       end);

julia> take!(out)
-0.8798585959543993

julia> put!(in, quote
           using Random
           Random.seed!(42)
           randn()
       end);

julia> take!(out)
0.7883556016042917

julia> put!(in, quote
           randn()
       end);

julia> take!(out)
-0.8798585959543993

What's the best way to do the same with Malt?

I tried the following without success:

julia> using Malt

julia> w = Malt.Worker();

julia> cin = Malt.worker_channel(w, :(cin = Channel{Expr}(0)));

julia> cout = Malt.worker_channel(w, :(cout = Channel{Any}(0)));

julia> Malt.remote_eval(w, quote
           while true
               expr = take!(cin)
               put!(cout, eval(expr))
           end
       end);

julia> put!(cin, quote
           using Random
           Random.seed!(42)
           randn()
       end);
ERROR: Remote exception from Malt.Worker on port 9591 with PID 96591:

MethodError: Cannot `convert` an object of type Float64 to an object of type Expr
The function `convert` exists, but no method is defined for this combination of argument types.

Closest candidates are:
  Expr(::Any...)
   @ Core boot.jl:271
  convert(::Type{T}, !Matched::T) where T
   @ Base Base.jl:126

Stacktrace:
 [1] put!(c::Channel{Expr}, v::Float64)
   @ Base ./channels.jl:357
 [2] top-level scope
   @ REPL[7]:4
 [3] eval
   @ ./boot.jl:430 [inlined]
 [4] (::var"#1#2"{Sockets.TCPSocket, UInt64, Bool, @Kwargs{}, Tuple{Module, Expr}, typeof(Core.eval)})()
   @ Main ~/.julia/packages/Malt/Z3YQq/src/worker.jl:120
Stacktrace:
 [1] unwrap_worker_result(worker::Malt.Worker, result::Malt.WorkerResult)
   @ Malt ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:50
 [2] _wait_for_response(worker::Malt.Worker, msg_id::UInt64)
   @ Malt ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:325
 [3] _send_receive
   @ ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:336 [inlined]
 [4] #remote_call_wait#43
   @ ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:422 [inlined]
 [5] remote_call_wait
   @ ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:421 [inlined]
 [6] remote_eval_wait
   @ ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:491 [inlined]
 [7] put!(rc::Malt.RemoteChannel{Any}, v::Expr)
   @ Malt ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:532
 [8] top-level scope
   @ REPL[7]:1

OK, the Malt example doesn't error anymore when I remove the type parameter of the Channel (seems like a bug?). But then the random numbers in subsequent evaluations are still not reproducible:

julia> using Malt

julia> w = Malt.Worker();

julia> cin = Malt.worker_channel(w, :(cin = Channel(0)));

julia> cout = Malt.worker_channel(w, :(cout = Channel(0)));

julia> put!(cin, quote
           using Random
           Random.seed!(42)
           randn()
       end);

julia> take!(cout)
0.7883556016042917

julia> put!(cin, quote
           randn()
       end)

julia> take!(cout)
0.10690497033454253

julia> put!(cin, quote
           using Random
           Random.seed!(42)
           randn()
       end);

julia> take!(cout)
0.7883556016042917

julia> put!(cin, quote
           randn()
       end)

julia> take!(cout)
-0.047221952638015874

julia> put!(cin, quote
           using Random
           Random.seed!(42)
           randn()
       end);

julia> take!(cout)
0.7883556016042917

julia> put!(cin, quote
           randn()
       end)

julia> take!(cout)
0.37395225126097864

Should this discrepancy with Distributed be considered a bug? Or do I use Malt incorrectly?

Oct 11 '24 18:10 devmotion

I thought it would look something like this:

using Malt

function initialize_special_channels(worker)
    @assert Malt.remote_eval_fetch(worker, quote
        var"__special_channel_out" = Channel()
        var"__special_channel_in" = Channel{Expr}() do chan
            for expr in chan
                result = Core.eval(Main, expr)
                put!(var"__special_channel_out", result)
            end
        end
        true
    end)
end

function remote_eval_fetch_channeled(worker, expr::Expr)
    code = quote
        Threads.@spawn put!(var"__special_channel_in", $(QuoteNode(expr)))
        take!(var"__special_channel_out")
    end
    return Malt.remote_eval_fetch(worker, code)
end

w = Malt.Worker()
initialize_special_channels(w)

I do get consistent random numbers with this technique:

Oct 14 '24 09:10 jkrumbiegel

Should we go with that? The reproducibility issue came up in our triage call today, and we thought we should inform users in the documentation about this problem if it takes more time to fix it. But if we can fix it right away, this might be the better option?

Oct 14 '24 11:10 devmotion

Should we go with that?

Sounds fine to me :+1:

Oct 14 '24 11:10 MichaelHatherly

I'll attempt a PR

Oct 14 '24 11:10 jkrumbiegel

Nice! Could you also make a PR to Malt to add this consistency as a test?

Oct 14 '24 12:10 fonsp

Nice! Could you also make a PR to Malt to add this consistency as a test?

@fonsp Wait I thought I'd implement this logic in QuartoNotebookRunner, did you think it was a good fit for Malt itself?

Oct 14 '24 12:10 jkrumbiegel

No my suggestion is to add a test to Malt for this behaviour, to guarantee that this will continue working in future Malt versions, since you depend on it.

Oct 14 '24 13:10 fonsp

To my understanding I'm not depending on anything new in Malt with this workaround, I'm depending on Julia's channel and task behavior. If you change your implementation some day to use a stable task for all evals then the workaround might not be necessary anymore but it shouldn't break.

Oct 14 '24 14:10 jkrumbiegel

QuartoNotebookRunner.jl QuartoNotebookRunner.jl copied to clipboard

Seeded RNGs only work reproducibly within a cell

Description

MWE

My versioninfo from the notebook above

QuartoNotebookRunner.jl
QuartoNotebookRunner.jl copied to clipboard

My `versioninfo` from the notebook above