Pluto.jl icon indicating copy to clipboard operation
Pluto.jl copied to clipboard

Replace Distributed with Malt

Open savq opened this issue 1 year ago β€’ 1 comments

This PR replaces Distributed with Malt.jl to add support for Distributed itself. See #300

  • Remove workspace_use_distributed flag. Pluto will now always use separate processes.
  • Replace calls to remotecall_eval with the appropiate Malt calls.
  • Replace Distributed with Malt in tests to remove it from dependencies.
    • Add unmake_workspace calls in MacroAnalysis to prevent creating too many processes at the same time.
  • Replace Distributed exceptions with Malt exceptions.
  • Replace pid::Integer with worker::Worker in WorkspaceManager.
  • Rename workspaces to active_workspaces.
  • Rename distributed_exception_result to workspace_exception_result.
  • Rename make_distributed_serializable to make_serializable.

Works OK

  • Evaluating code with Malt works.
  • Throwing exceptions in the notebook works.

Works with caveats

  • Distributed works, but it requires the user to run Pluto with the capture_stdout flag set to false.
    • Why: Distributed workers print their port to stdout. If the stdout is captured, addprocs will hang forever.
    • long-term solution: Use tee-like function to duplicate stdout.
  • Notebook termination works… better? You don't have to spam ctrl-C, but there's still a no exception handler available. error.
    • Why: The problem seems to occur on the pluto server. Malt workers belong to the same process group as the main process, so they always receive and handle the signals.
    • Long-term solution: The problem seems to be that the InterruptException isn't caught by the task that we want, and more importantly there's no way to know which task was running (we'd need better dynamic analysis tools). On Malt's side, this is solved by having an outer exception handler (for the server loop), and inner exception handlers for each request/evaluation, ensuring every task has a handler. A similar mechanism might be useful on Pluto's side, but requires a lot of care since we basically need to know all the tasks that might be running at any given time.
  • Logging works when using the local logger. Making the logger global will raise a world age error.
    • why: No idea. But world age issues usually revolve around a misuse of eval.
    • Solution: Unknown.

Additional notes

  • The timeout for interrupting the workspace should probably be different depending on whether we're actually interrupting a cell or terminating the workspace, because it's too long for the latter.

TODO

  • Count Julia processes in a non-intrusive manner (for tests).

This PR is part of GSoC 2022. Some info on that here.

savq avatar Aug 05 '22 16:08 savq

Try this Pull Request!

Open Julia and type:

julia> import Pkg
julia> Pkg.activate(temp=true)
julia> Pkg.add(url="https://github.com/savq/Pluto.jl", rev="malt")
julia> using Pluto

github-actions[bot] avatar Aug 05 '22 16:08 github-actions[bot]

Would this change allow to interrupt cells on windows as well, or is that broken anyhow?

disberd avatar Oct 11 '22 08:10 disberd

Tried to use this with Dagger.jl and Distributed and for some reason, the addprocs() gets stuck. In Malt.jl though, it does work. πŸ‘€

pankgeorg avatar Aug 09 '23 10:08 pankgeorg

@pankgeorg Tried to use this with Dagger.jl and Distributed and for some reason, the addprocs() gets stuck. In Malt.jl though, it does work. πŸ‘€

This might have been because this PR won't enable the new Malt backend by default (first some testing period!) (although i temporarily enabled it by default in my previous commit to get CI to test Malt). Use workspace_use_distributed_stdlib = false to force using Malt.

fonsp avatar Sep 12 '23 18:09 fonsp

😒 This PR is pretty much done but I'm getting strange out-of-memory errors on Windows, but I don't think it is actually running out of RAM, it seems more like a socket/libuv bug.

@pankgeorg or @Pangoraw could you check if this also reproduces locally? (check out the branch and run test Pluto) Any insights? It looks like this happens somewhere here:

https://github.com/savq/Pluto.jl/blob/c4bb0e2cfc7b701d466ead2e7cb20275a846d7a7/test/Configuration.jl#L60-L148

(uncomment all other tests in runtests.jl to make it faster)

It might be caused by this line, where the shutdown interrupt is triggered async without waiting for it to finish:

https://github.com/savq/Pluto.jl/blob/c4bb0e2cfc7b701d466ead2e7cb20275a846d7a7/test/Configuration.jl#L126

This might be fixed by replacing the @async with try ... catch end but it's strange that this was not happening before this PR. (Because we launch the worker process in a way quite similar to Distributed.)

The CI logs also suggest that the worker did not write its port number to stdout, and some of the out-of-memory stack traces go through the flush call:

https://github.com/JuliaPluto/Malt.jl/blob/006c4f679fb4f0e822dd734e41aaf527b8469a4b/src/worker.jl#L26-L27

help!

fonsp avatar Sep 13 '23 10:09 fonsp

Do we have OPENBLAS_NUM_THREADS=1 set on the worker process env by defautl (see https://github.com/JuliaLang/julia/pull/47803#issuecomment-1338062937) ?

The issue on VERSION >= 1.9 might be different (ref https://github.com/JuliaLang/julia/issues/50325).

Pangoraw avatar Sep 13 '23 10:09 Pangoraw

I added a commit with OPENBLAS_NUM_THREADS=1 to test, feel free to revert.

Note that Pluto workers currently have this through the Distributed pr above: image

Pangoraw avatar Sep 13 '23 11:09 Pangoraw

Wowww well found @Pangoraw ! This is currently set on all distributed procs for every OS, right? Let's do the same by default on Malt directly?

fonsp avatar Sep 13 '23 13:09 fonsp

Not sure about the nightly test, and on 1.6 the test never seems to start...

fonsp avatar Sep 13 '23 13:09 fonsp

This is currently set on all distributed procs for every OS, right? Let's do the same by default on Malt directly?

yes, this is probably the right move.

For nightly, we could test 1.10.0-beta2 instead? and for 1.6, I don't know...

Pangoraw avatar Sep 13 '23 13:09 Pangoraw

Which 1.10 are we testing right now? Testing the beta sounds good!

I also made #2643 to avoid the interrupt call. This means avoiding the problem instead of fixing it, but I don't think that supporting schedule(pluto_server_task, InterruptException()) on Windows is something we need to test. It's important that Ctrl+C in the terminal shuts down the server, but I think that we are simulating this wrong with schedule, so I don't really know how to test for that.

fonsp avatar Sep 13 '23 14:09 fonsp

nightly is actually julia master (future 1.11), it would be nice if a docker tag for 1.10 existed so that we can get the alphas/betas & stable release for 1.10 automatically.

Pangoraw avatar Sep 13 '23 14:09 Pangoraw

Still failing with out-of-memory on Windows Julia 1.6 :(

fonsp avatar Sep 13 '23 17:09 fonsp

Maybe we just merge for testing (disabled by default), and eventually only enable Malt as default on Julia 1.8 and up?

@Pangoraw wdyt, any other ideas?

fonsp avatar Sep 16 '23 20:09 fonsp

Or @savq any ideas why we have this last failure with Malt julia 1.6 windows, but not with Distributed backend or another julia distro?

fonsp avatar Sep 16 '23 20:09 fonsp

@Pangoraw wdyt, any other ideas?

I don't have more ideas, we might have to test in real windows and see if we can spot the problem here.

Pangoraw avatar Sep 17 '23 18:09 Pangoraw

(btw some improvement to Crrl+C on windows in https://github.com/JuliaLang/julia/pull/51307)

fonsp avatar Sep 18 '23 08:09 fonsp

(btw some improvement to Crrl+C on windows in JuliaLang/julia#51307)

I do not see an improvement, see https://julialang.slack.com/archives/C6SMTHQ3T/p1694888153763359?thread_ts=1694639864.397389&cid=C6SMTHQ3T (which contains that PR)

pankgeorg avatar Sep 18 '23 10:09 pankgeorg

Heyy so I think I know what happened!

This initial PR removed the workspace_use_distributed option altogether, then later we added it to Malt (https://github.com/JuliaPluto/Malt.jl/pull/39) but some of the tests in the PR still had the workspace_use_distributed=false setting removed.

I fixed this in two recent commits: 7b09e1d (#2240) and a37fd31 (#2240) and then suddenly the tests passed with Malt enabled! fd8c68a (#2240)

I noticed that we were also getting OOM errors with the Distributed backend as default (27307e3 (#2240)) which means that it's not caused by Malt :) It's just some freaky combination of starting/closing an HTTP server while starting a Julia process that does not work on Windows.

Yayy so I will merge this with Malt still disabled by default, for the testing period, and then we can enable it by default soon (for all Julia versions)!

fonsp avatar Sep 18 '23 13:09 fonsp

Awesome!! :tada: :tada:

Pangoraw avatar Sep 18 '23 13:09 Pangoraw