DataFrames.jl
DataFrames.jl copied to clipboard
Support multithreading in groupreduce
Keep the default to a single thread until we find a reliable way of predicting a reasonably optimal number of threads.
Note that after https://github.com/JuliaData/DataFrames.jl/pull/2481 method signatures will change so this PR will have merge conflicts.
I've rebased against master. Should be good for a review now. I just hope CI really used 2 threads but it's not visible in the logs AFAICT. EDIT: seeing Codecov, probably not.
EDIT: seeing Codecov, probably not.
@quinnj - could you please have a quick look at the CI set-up? Thank you!
Wait, I just did something completely stupid when setting the environment variable. I've committed a probable fix.
OK that worked!
I've added DataFrames.nthreads
(called to choose the default nthreads
) and DataFrames.nthreads!
to set it, to make it easier to choose the number of threads once for a session, as requested by Jan.
I was just thinking about nthreads!
sounding a little weird because it's not a verb. Given that it's a global setting, would it make sense to create a struct that holds on to all global settings? Then, the user can configure it via setproperty!
. Just a thought...
I was just thinking about
nthreads!
sounding a little weird because it's not a verb. Given that it's a global setting, would it make sense to create a struct that holds on to all global settings? Then, the user can configure it viasetproperty!
. Just a thought...
Yes, I also thought about setnthreads!
. That's an eternal question about naming conventions for setters in Julia. But note that we have Random.seed!
already in the stdlib, which is quite similar.
Adding a struct for global settings sounds overkill at this point since it's not clear we're going to add more settings in the future. I'd rather keep this minimal, so that it can be extended if needed in the future, e.g. using Preferences.jl for storage across sessions.
I'd rather keep this minimal
I am OK with the current design as it is an implementation detail. Though probably having a mutable struct
instead of Ref
is more natural (but we can always change it in the future).
Regarding the naming - I am OK with nthreads!
, but in general I do not have a super strong opinion here.
The Ref
is an implementation detail, but we have to decide whether DataFrames.nthreads
and DataFrames.nthreads!
are part of the public API. If we're hesitant we could avoid mentioning these in the docs and only use them for benchmarking for now (like DataFrames.precompile
).
I am OK to have them as a part of the public API.
@StefanKarpinski - if Julia Base were to add an option to change number of available threads dynamically during the Julia session, would Threads.nthreads!
would be a name for this operation?
This can be closed given the other approach we took? Or we keep it open for the future?
This PR is complementary with the other threaded grouping PRs, since without it optimized reductions are not multithreaded. We should check whether we can identify conditions under which multithreading is faster than single threading.
I have totally forgotten about this PR. I thought you have merged it :). So what is the status here - are you working on the condition or I should review it as is?
Thank you!
Given that we now automatically spawn as many tasks as there are threads in other places, requiring users to set NTHREADS
for this function may be a bit weird. I guess we should try to automatically compute a reasonable number of threads -- but that turned out to be quite tricky.
equiring users to set
NTHREADS
for this function may be a bit weird
I agree
but that turned out to be quite tricky.
Is it possible to give some safe condition though? (I do not remember the details of our earlier discussion)
Is it possible to give some safe condition though? (I do not remember the details of our earlier discussion)
Probably. I need to look at the data again. What's difficult is that there are several dimensions to cross: number of groups, number of rows, number of threads. And IIRC the overhead due to threading is larger when Julia was started with many threads.
Finally, there's always the possibility that all CPUs are already busy with another threaded operation, so that spawning multiple tasks is counter-productive. Maybe that's not a serious issue.
Should I mark it as 1.0, or keep it 1.x and we will merge it when we are comfortable?
I think this doesn't have to happen before 1.0.
@nalimilan - what do we do with this PR? Close it (as it is likely outdated)?
In general - for the future we should rather add a kwarg allowing to choose if users wants to use multithreading c.f. https://github.com/JuliaData/DataFrames.jl/issues/2992.
bump
@nalimilan - what do you think we should do with this PR?
I'm not sure. There are two aspects to address:
- If we decide to add keyword arguments to choose the number of threads (https://github.com/JuliaData/DataFrames.jl/issues/2992, https://github.com/JuliaData/DataFrames.jl/issues/2988), it would make sense to (rebase and) merge this.
- Otherwise, we would need to find conditions under which using multiple threads is always faster. I haven't retried.
@nalimilan - following our discussion I recommend to do what is written in https://github.com/JuliaData/DataFrames.jl/issues/2988#issuecomment-1046664288.
We would then implement this proposed change. There are the following things to be done to implement this change:
- implement the feature of this PR
- implement enabling/disabling threading of what you propose in this PR
- implement enabling/disabling threading in general in
_combine
(across transformation operations) - implement
ntasks
keyword argument in all relevant functions
@nalimilan - can you please comment which parts of this list you would be willing to have a look at (I understand that 1, 2, and 4 must be covered in this PR, but 3 can be done in a separate follow-up PR that I can do if you prefer after this PR is finished but before 1.4 release)
Additional maintenance tasks:
- add tests for all cases
- create and run benchmarks
- update docstrings
- update documentation
- update NEWS.md
OK I'll rebase this PR, but I'd rather leave point 3 for another one.
Sure. Thank you!
(probably starting a new branch and making proper changes + force push rather than rebase would be easier, as I am not sure by how much we have changed the sources you touched)
@nalimilan - what are your plans with this PR 😄? (I am asking because it is getting so old that there is a risk that it is better to re-implement it from scratch if we want to have it given the recent changes)
Now that we have the threads
keyword argument, we could allow passing an integer to enable multithreaded groupreduce
. But it would be kind of weird that the default threads=true
wouldn't enable multithreading here, while e.g. threads=2
would. Also threads=2
wouldn't actually limit the number of tasks to 2 as the existing code doesn't really allow doing that (nor should it). And anyway many people wouldn't benefit from it as they will probably expect threading to be automatic, just like it is in other cases.
So it would probably be worth experimenting more to find conditions under which we can be sure starting multiple tasks is faster.
OK - so let us keep it open for now.
I converted this PR to draft to avoid accidental merging.