Moar threadsafe moar better
This is a rebased version of #4, it should be ready to merge. Fixes #73.
(made after discussing with @jpsamaroo)
CC @vchuravy, @vtjnash
I tracked the test failures down to: https://github.com/JuliaLang/julia/pull/53326
The workers are loading the builtin version of Distributed from the sysimg while the master is using the development version, and because we changed the definition of the Worker struct in this PR we get a serialization error when doing a remotecall to another worker because the master and worker are using different types.
I think this should be fixed in Base, made a PR here: https://github.com/JuliaLang/julia/pull/54571 That should be merged before this one.
Codecov Report
Attention: Patch coverage is 84.74576% with 9 lines in your changes missing coverage. Please review.
Project coverage is 79.50%. Comparing base (
3b9e4fd) to head (90041ca). Report is 9 commits behind head on master.
| Files with missing lines | Patch % | Lines |
|---|---|---|
| src/cluster.jl | 86.95% | 6 Missing :warning: |
| src/managers.jl | 50.00% | 1 Missing :warning: |
| src/messages.jl | 0.00% | 1 Missing :warning: |
| src/process_messages.jl | 85.71% | 1 Missing :warning: |
Additional details and impacted files
@@ Coverage Diff @@
## master #101 +/- ##
==========================================
+ Coverage 79.24% 79.50% +0.25%
==========================================
Files 10 10
Lines 1913 1922 +9
==========================================
+ Hits 1516 1528 +12
+ Misses 397 394 -3
:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.
Alrighty, after a force push to trigger CI with latest nightly we are back in the green :partying_face: I think this is ready to be merged now.
Will this be compatible to or backported, if necessary, to the next Julia LTS?
I'll leave that for someone more qualified to properly answer, but FWIW if 1.11 is chosen as the next LTS then it'll be possible to upgrade Distributed now that it's an excised stdlib :octopus:
One other thing I noticed is that this should probably be using Threads.@spawn everywhere instead of @async, but I'll leave that for another PR.
I believe @jblaschke was going to have a look at it. In the meantime I see the branch is out of date, so I'll rebase it.
@JamesWrigley @jonas-schulze I'll be working on this this week. Just getting back up to speed after long travel...
Are there any updates on this? @JamesWrigley @JBlaschke
Not from me, still need someone to review it. I believe the hesitation to merge comes from Distributed being used to run the Julia tests, so it's quite critical that this works properly.
But in the meantime you can ] dev https://github.com/JamesWrigley/Distributed.jl.git#jps/threadsafe_workerstate on 1.11 to use this branch.
Since we are making things threadsafe I would look at all @async uses because we want to deprecate it.
Ok, I replaced all uses of @async with Threads.@spawn in 7cac33f. But there are still a couple places where Base.@async_unwrap is used and I can't find an easy analogue for Threads.@spawn, I'll see if I can write a replacement.
After thinking about it for a bit, I can't come up with a decent replacement short of basically reimplementing Threads.@spawn. We could replace @async_unwrap with Threads.@spawn, but it would change the type of any exceptions thrown from whatever they currently are to a TaskFailedException, which is technically breaking.
I'd suggest keeping it in for now, we can add support for unwrapping exceptions to Threads.@spawn and switch in a future release.
Wee progress update for those following this, after some discussion with @vchuravy on Slack I decided to:
- Try to run the Julia tests in the Distributed.jl CI since that's what was failing last time.
- Attempt to audit the code for correctness.
Which I'll get to at... some point.
Wee update on this, the Julia tests 'failing' last time means they were hanging. In DistributedNext I've fixed a couple of hangs related to lingering tasks, perhaps that will solve the hangs.
About the code audit, if someone from the core team commits to reviewing this then I can try to do the audit before the 1.12 release if that's desired.
I'm happy to review (in addition to the core devs)
Seems like there's a small conflict
(should be fixed now)
Apart from my suggestions above, this seems like it's a big improvement, and given it's lingered a bit waiting for reviews, it may be reasonable to just go ahead and fix any issues in the 1.12 pre-release process
I agree :) And as I said almost a year ago when I first started pestering people about it, I'm willing to debug any issues in Julia CI that arise.
If this is merged I will also upstream another hang fix: https://github.com/JuliaParallel/DistributedNext.jl/pull/17#issuecomment-2573871918
(or upstream it eventually anyway when that PR is merged)
Btw, should we rename this PR to "Thread safety improvements v2"?
What do you mean the title makes perfect sense :grin:
(I kid, feel free to change it if you like)