QuartoNotebookRunner.jl
QuartoNotebookRunner.jl copied to clipboard
Seeded RNGs only work reproducibly within a cell
Description
Calling Random.seed! at the top of a notebook doesn't ensure reproducibility of random number generation in other cells; it only works within the same cell that seed! is called.
I'm not sure if this is intentional, but it's slightly counterintuitive behaviour so if it's intentional it might be good to document it somewhere.
A related issue (and perhaps an example of how this behaviour is confusing!) is that the Pumas tutorial on reproducibility says that the second batch of samples in Random.seed!(123); rand(3); rand(3) should always result in "the same behavior [...] every time we run the code". This is true if done in a REPL, but the Quarto source has the second rand(3) in a different cell and consequently the second rand(3) will not always return the same numbers.
MWE
The following notebook works to demonstrate the behaviour. The first call to randn() should always give the same result no matter how many times the notebook is rendered. The second call however will not.
---
engine: julia
---
```{julia}
versioninfo()
```
```{julia}
using Random
Random.seed!(42)
randn()
```
```{julia}
randn()
```
My versioninfo from the notebook above
Julia Version 1.10.5
Commit 6f3fdf7b362 (2024-08-27 14:19 UTC)
Build Info:
Official https://julialang.org/ release
Platform Info:
OS: macOS (arm64-apple-darwin22.4.0)
CPU: 10 × Apple M1 Pro
WORD_SIZE: 64
LIBM: libopenlibm
LLVM: libLLVM-15.0.7 (ORCJIT, apple-m1)
Threads: 1 default, 0 interactive, 1 GC (on 8 virtual cores)
Environment:
JULIA_LOAD_PATH = @:@stdlib
I think this might be an upstream issue but I'm not too familiar with Malt so maybe it's intended and/or could be avoided:
julia> using Malt
julia> w = Malt.Worker();
julia> Malt.remote_eval_fetch(w, quote
using Random
Random.seed!(42)
randn()
end)
0.7883556016042917
julia> Malt.remote_eval_fetch(w, quote
randn()
end)
0.8653602653617717
julia> Malt.remote_eval_fetch(w, quote
using Random
Random.seed!(42)
randn()
end)
0.7883556016042917
julia> Malt.remote_eval_fetch(w, quote
randn()
end)
-0.6715764280647957
Will investigate, thanks for the thorough issue description @penelopeysm.
Maybe this has something to do with task-local RNG state? Could be that Malt evals in different tasks per fetch and that this has an effect on the RNG. If they were doing some random number generation themselves that changed the RNG state, then it should still lead to the same numbers if it was the same n operations between two remote_eval_fetches so that can't be it.
Hmm, the RNG object id seems to be stable, so it seems to be the same RNG?
julia> Malt.remote_eval_fetch(w, quote
objectid(Random.default_rng())
end)
0xffffffff0968f0b4
julia> Malt.remote_eval_fetch(w, quote
objectid(Random.default_rng())
end)
0xffffffff0968f0b4
julia> Malt.remote_eval_fetch(w, quote
objectid(Random.default_rng())
end)
0xffffffff0968f0b4
julia> Malt.remote_eval_fetch(w, quote
objectid(Random.default_rng())
end)
0xffffffff0968f0b4
I think that's because it's TaskLocalRNG and that's a singleton type:
julia> TaskLocalRNG() === TaskLocalRNG()
true
The RNG state for that is stored in the task itself.
cc @fonsp in case you happen to have any insights.
This shows that we get different task-local RNG state in a new remote_eval_fetch:
julia> Malt.remote_eval_fetch(w, quote
using Random
Random.seed!(123)
rngfields = filter(fieldnames(Task)) do field
startswith(string(field), "rng")
end
state = NamedTuple(map(rngfields) do field
field => getfield(Base.current_task(), field)
end)
end)
(rngState0 = 0xfefa8d41b8f5dca5, rngState1 = 0xf80cc98e147960c1, rngState2 = 0x20e2ccc17662fc1d, rngState3 = 0xea7a7dcb2e787c01, rngState4 = 0xf4e85a418b9c4f80)
julia> Malt.remote_eval_fetch(w, quote
rngfields = filter(fieldnames(Task)) do field
startswith(string(field), "rng")
end
state = NamedTuple(map(rngfields) do field
field => getfield(Base.current_task(), field)
end)
end)
(rngState0 = 0x4e291591396795b9, rngState1 = 0xc9a680b08b4c9006, rngState2 = 0x3bb5088d1c600f58, rngState3 = 0xe2bb063e7e800a67, rngState4 = 0xe264ff418d928577)
Probably because each invocation is a different task?
julia> Malt.remote_eval_fetch(w, quote
objectid(Base.current_task())
end)
0xb68e01fb526e2337
julia> Malt.remote_eval_fetch(w, quote
objectid(Base.current_task())
end)
0x49246d797797f2ad
If this is hard to fix in Malt, I guess we could store the rngState fields after a task runs a cell, and then first thing in the next cell, mutate that one's rngState fields so they take the same values. Not sure if Malt can be configured to use only a single task.
Maybe one could take some inspiration from how Test resets the task-local RNG: https://github.com/JuliaLang/julia/blob/1438b1578941d0f1cc8f8c958cf3bd2927fd482c/stdlib/Test/src/Test.jl#L1694-L1720
Interesting, it seems this looked quite different on 1.10 where Random.get_tls_seed() did not exist
https://github.com/JuliaLang/julia/blob/6f3fdf7b36250fb95f512a2b927ad2518c07d2b5/stdlib/Test/src/Test.jl#L1567-L1578
@Pangoraw knows best about RNGs and threading in Pluto :)
remote_eval_fetch indeed spawns a new task at each call. One solution would be for QuartoNotebookWorker to manage its own persistent task for evaluation and then communicate with this task using channels.
The more I thought about this issue, the more I got the feeling that this is an upstream issue in Malt. The behaviour seems completely surprising, in particular given that everything else is consistent between subsequent Malt.remote_eval_fetch calls and there were not tasks involved in the to-be-evaluated expressions sent to the worker (additionally, in my example above the worker used a single thread). Additionally, it seemed the behaviour was not documented, so I assumed possibly it was not even intentional.
But then I tried the same example with Distributed:
julia> using Distributed
julia> addprocs(1)
1-element Vector{Int64}:
2
julia> remotecall_fetch(eval, 2, quote
using Random
Random.seed!(42)
randn()
end)
0.7883556016042917
julia> remotecall_fetch(eval, 2, quote
randn()
end)
0.08641660475950964
julia> remotecall_fetch(eval, 2, quote
using Random
Random.seed!(42)
randn()
end)
0.7883556016042917
julia> remotecall_fetch(eval, 2, quote
randn()
end)
1.559594774368473
julia> remotecall_fetch(eval, 2, quote
objectid(Base.current_task())
end)
0xebc117a52ab2a7be
julia> remotecall_fetch(eval, 2, quote
objectid(Base.current_task())
end)
0xe14f2bb195cecba7
So it seems Malt is consistent with Distributed. I couldn't find any issue related to it in the Distributed repo but in the Julia repo someone had opened an issue regarding this behaviour: https://github.com/JuliaLang/julia/issues/51225 It also links to a Pluto issue: https://github.com/fonsp/Pluto.jl/issues/2290
So my take away is that the behaviour of Malt is consistent with Distributed but that this behaviour - both in Distributed and in Malt - is confusing and surprising for users. Based on the discussion in the Julia issue, I assume Distributed won't be changed - but since there are issues from both Pluto and Quarto users, maybe it could be addressed in Malt? Maybe Malt could expose an API for evaluations within a persistent task?
This is an approach for Distributed (maybe it can be done more elegantly or efficiently?):
julia> using Distributed
julia> addprocs(1)
1-element Vector{Int64}:
2
julia> const in = RemoteChannel(() -> Channel{Expr}(0));
julia> const out = RemoteChannel(() -> Channel{Any}(0));
julia> @everywhere function evalloop(out, in)
while true
expr = take!(in)
put!(out, eval(expr))
end
end
julia> remote_do(evalloop, 2, out, in)
julia> put!(in, quote
using Random
Random.seed!(42)
randn()
end);
julia> take!(out)
0.7883556016042917
julia> put!(in, quote
randn()
end);
julia> take!(out)
-0.8798585959543993
julia> put!(in, quote
using Random
Random.seed!(42)
randn()
end);
julia> take!(out)
0.7883556016042917
julia> put!(in, quote
randn()
end);
julia> take!(out)
-0.8798585959543993
What's the best way to do the same with Malt?
I tried the following without success:
julia> using Malt
julia> w = Malt.Worker();
julia> cin = Malt.worker_channel(w, :(cin = Channel{Expr}(0)));
julia> cout = Malt.worker_channel(w, :(cout = Channel{Any}(0)));
julia> Malt.remote_eval(w, quote
while true
expr = take!(cin)
put!(cout, eval(expr))
end
end);
julia> put!(cin, quote
using Random
Random.seed!(42)
randn()
end);
ERROR: Remote exception from Malt.Worker on port 9591 with PID 96591:
MethodError: Cannot `convert` an object of type Float64 to an object of type Expr
The function `convert` exists, but no method is defined for this combination of argument types.
Closest candidates are:
Expr(::Any...)
@ Core boot.jl:271
convert(::Type{T}, !Matched::T) where T
@ Base Base.jl:126
Stacktrace:
[1] put!(c::Channel{Expr}, v::Float64)
@ Base ./channels.jl:357
[2] top-level scope
@ REPL[7]:4
[3] eval
@ ./boot.jl:430 [inlined]
[4] (::var"#1#2"{Sockets.TCPSocket, UInt64, Bool, @Kwargs{}, Tuple{Module, Expr}, typeof(Core.eval)})()
@ Main ~/.julia/packages/Malt/Z3YQq/src/worker.jl:120
Stacktrace:
[1] unwrap_worker_result(worker::Malt.Worker, result::Malt.WorkerResult)
@ Malt ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:50
[2] _wait_for_response(worker::Malt.Worker, msg_id::UInt64)
@ Malt ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:325
[3] _send_receive
@ ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:336 [inlined]
[4] #remote_call_wait#43
@ ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:422 [inlined]
[5] remote_call_wait
@ ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:421 [inlined]
[6] remote_eval_wait
@ ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:491 [inlined]
[7] put!(rc::Malt.RemoteChannel{Any}, v::Expr)
@ Malt ~/.julia/packages/Malt/Z3YQq/src/Malt.jl:532
[8] top-level scope
@ REPL[7]:1
OK, the Malt example doesn't error anymore when I remove the type parameter of the Channel (seems like a bug?). But then the random numbers in subsequent evaluations are still not reproducible:
julia> using Malt
julia> w = Malt.Worker();
julia> cin = Malt.worker_channel(w, :(cin = Channel(0)));
julia> cout = Malt.worker_channel(w, :(cout = Channel(0)));
julia> put!(cin, quote
using Random
Random.seed!(42)
randn()
end);
julia> take!(cout)
0.7883556016042917
julia> put!(cin, quote
randn()
end)
julia> take!(cout)
0.10690497033454253
julia> put!(cin, quote
using Random
Random.seed!(42)
randn()
end);
julia> take!(cout)
0.7883556016042917
julia> put!(cin, quote
randn()
end)
julia> take!(cout)
-0.047221952638015874
julia> put!(cin, quote
using Random
Random.seed!(42)
randn()
end);
julia> take!(cout)
0.7883556016042917
julia> put!(cin, quote
randn()
end)
julia> take!(cout)
0.37395225126097864
Should this discrepancy with Distributed be considered a bug? Or do I use Malt incorrectly?
I thought it would look something like this:
using Malt
function initialize_special_channels(worker)
@assert Malt.remote_eval_fetch(worker, quote
var"__special_channel_out" = Channel()
var"__special_channel_in" = Channel{Expr}() do chan
for expr in chan
result = Core.eval(Main, expr)
put!(var"__special_channel_out", result)
end
end
true
end)
end
function remote_eval_fetch_channeled(worker, expr::Expr)
code = quote
Threads.@spawn put!(var"__special_channel_in", $(QuoteNode(expr)))
take!(var"__special_channel_out")
end
return Malt.remote_eval_fetch(worker, code)
end
w = Malt.Worker()
initialize_special_channels(w)
I do get consistent random numbers with this technique:
Should we go with that? The reproducibility issue came up in our triage call today, and we thought we should inform users in the documentation about this problem if it takes more time to fix it. But if we can fix it right away, this might be the better option?
Should we go with that?
Sounds fine to me :+1:
I'll attempt a PR
Nice! Could you also make a PR to Malt to add this consistency as a test?
Nice! Could you also make a PR to Malt to add this consistency as a test?
@fonsp Wait I thought I'd implement this logic in QuartoNotebookRunner, did you think it was a good fit for Malt itself?
No my suggestion is to add a test to Malt for this behaviour, to guarantee that this will continue working in future Malt versions, since you depend on it.
To my understanding I'm not depending on anything new in Malt with this workaround, I'm depending on Julia's channel and task behavior. If you change your implementation some day to use a stable task for all evals then the workaround might not be necessary anymore but it shouldn't break.