Distributed.jl icon indicating copy to clipboard operation
Distributed.jl copied to clipboard

Moar threadsafe moar better

Open JamesWrigley opened this issue 1 year ago • 13 comments

This is a rebased version of #4, it should be ready to merge. Fixes #73.

(made after discussing with @jpsamaroo)

CC @vchuravy, @vtjnash

JamesWrigley avatar May 23 '24 15:05 JamesWrigley

I tracked the test failures down to: https://github.com/JuliaLang/julia/pull/53326 The workers are loading the builtin version of Distributed from the sysimg while the master is using the development version, and because we changed the definition of the Worker struct in this PR we get a serialization error when doing a remotecall to another worker because the master and worker are using different types.

I think this should be fixed in Base, made a PR here: https://github.com/JuliaLang/julia/pull/54571 That should be merged before this one.

JamesWrigley avatar May 24 '24 12:05 JamesWrigley

Codecov Report

Attention: Patch coverage is 84.74576% with 9 lines in your changes missing coverage. Please review.

Project coverage is 79.50%. Comparing base (3b9e4fd) to head (90041ca). Report is 9 commits behind head on master.

Files with missing lines Patch % Lines
src/cluster.jl 86.95% 6 Missing :warning:
src/managers.jl 50.00% 1 Missing :warning:
src/messages.jl 0.00% 1 Missing :warning:
src/process_messages.jl 85.71% 1 Missing :warning:
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #101      +/-   ##
==========================================
+ Coverage   79.24%   79.50%   +0.25%     
==========================================
  Files          10       10              
  Lines        1913     1922       +9     
==========================================
+ Hits         1516     1528      +12     
+ Misses        397      394       -3     

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

codecov[bot] avatar May 25 '24 11:05 codecov[bot]

Alrighty, after a force push to trigger CI with latest nightly we are back in the green :partying_face: I think this is ready to be merged now.

JamesWrigley avatar May 25 '24 11:05 JamesWrigley

Will this be compatible to or backported, if necessary, to the next Julia LTS?

I'll leave that for someone more qualified to properly answer, but FWIW if 1.11 is chosen as the next LTS then it'll be possible to upgrade Distributed now that it's an excised stdlib :octopus:

JamesWrigley avatar Jun 02 '24 17:06 JamesWrigley

One other thing I noticed is that this should probably be using Threads.@spawn everywhere instead of @async, but I'll leave that for another PR.

JamesWrigley avatar Jun 03 '24 09:06 JamesWrigley

I believe @jblaschke was going to have a look at it. In the meantime I see the branch is out of date, so I'll rebase it.

JamesWrigley avatar Jun 24 '24 14:06 JamesWrigley

@JamesWrigley @jonas-schulze I'll be working on this this week. Just getting back up to speed after long travel...

JBlaschke avatar Jun 24 '24 15:06 JBlaschke

Are there any updates on this? @JamesWrigley @JBlaschke

zsz00 avatar Jul 21 '24 03:07 zsz00

Not from me, still need someone to review it. I believe the hesitation to merge comes from Distributed being used to run the Julia tests, so it's quite critical that this works properly.

But in the meantime you can ] dev https://github.com/JamesWrigley/Distributed.jl.git#jps/threadsafe_workerstate on 1.11 to use this branch.

JamesWrigley avatar Jul 21 '24 09:07 JamesWrigley

Since we are making things threadsafe I would look at all @async uses because we want to deprecate it.

gbaraldi avatar Jul 23 '24 18:07 gbaraldi

Ok, I replaced all uses of @async with Threads.@spawn in 7cac33f. But there are still a couple places where Base.@async_unwrap is used and I can't find an easy analogue for Threads.@spawn, I'll see if I can write a replacement.

JamesWrigley avatar Jul 23 '24 19:07 JamesWrigley

After thinking about it for a bit, I can't come up with a decent replacement short of basically reimplementing Threads.@spawn. We could replace @async_unwrap with Threads.@spawn, but it would change the type of any exceptions thrown from whatever they currently are to a TaskFailedException, which is technically breaking.

I'd suggest keeping it in for now, we can add support for unwrapping exceptions to Threads.@spawn and switch in a future release.

JamesWrigley avatar Jul 23 '24 20:07 JamesWrigley

Wee progress update for those following this, after some discussion with @vchuravy on Slack I decided to:

  1. Try to run the Julia tests in the Distributed.jl CI since that's what was failing last time.
  2. Attempt to audit the code for correctness.

Which I'll get to at... some point.

JamesWrigley avatar Aug 28 '24 08:08 JamesWrigley

Wee update on this, the Julia tests 'failing' last time means they were hanging. In DistributedNext I've fixed a couple of hangs related to lingering tasks, perhaps that will solve the hangs.

About the code audit, if someone from the core team commits to reviewing this then I can try to do the audit before the 1.12 release if that's desired.

JamesWrigley avatar Jan 20 '25 13:01 JamesWrigley

I'm happy to review (in addition to the core devs)

IanButterworth avatar Jan 20 '25 13:01 IanButterworth

Seems like there's a small conflict

IanButterworth avatar Jan 20 '25 13:01 IanButterworth

(should be fixed now)

JamesWrigley avatar Jan 20 '25 13:01 JamesWrigley

Apart from my suggestions above, this seems like it's a big improvement, and given it's lingered a bit waiting for reviews, it may be reasonable to just go ahead and fix any issues in the 1.12 pre-release process

IanButterworth avatar Jan 20 '25 20:01 IanButterworth

I agree :) And as I said almost a year ago when I first started pestering people about it, I'm willing to debug any issues in Julia CI that arise.

JamesWrigley avatar Jan 21 '25 18:01 JamesWrigley

If this is merged I will also upstream another hang fix: https://github.com/JuliaParallel/DistributedNext.jl/pull/17#issuecomment-2573871918

(or upstream it eventually anyway when that PR is merged)

JamesWrigley avatar Jan 21 '25 18:01 JamesWrigley

Btw, should we rename this PR to "Thread safety improvements v2"?

IanButterworth avatar Jan 21 '25 19:01 IanButterworth

What do you mean the title makes perfect sense :grin:

(I kid, feel free to change it if you like)

JamesWrigley avatar Jan 21 '25 19:01 JamesWrigley