DataFrames.jl Support multithreading in groupreduce

Support multithreading in groupreduce

Open nalimilan opened this issue 3 years ago • 30 comments

Keep the default to a single thread until we find a reliable way of predicting a reasonably optimal number of threads.

Oct 17 '20 21:10 nalimilan

Note that after https://github.com/JuliaData/DataFrames.jl/pull/2481 method signatures will change so this PR will have merge conflicts.

Oct 17 '20 22:10 bkamins

I've rebased against master. Should be good for a review now. I just hope CI really used 2 threads but it's not visible in the logs AFAICT. EDIT: seeing Codecov, probably not.

Nov 25 '20 10:11 nalimilan

EDIT: seeing Codecov, probably not.

@quinnj - could you please have a quick look at the CI set-up? Thank you!

Nov 25 '20 10:11 bkamins

Wait, I just did something completely stupid when setting the environment variable. I've committed a probable fix.

Nov 25 '20 10:11 nalimilan

OK that worked!

Nov 25 '20 10:11 nalimilan

I've added DataFrames.nthreads (called to choose the default nthreads) and DataFrames.nthreads! to set it, to make it easier to choose the number of threads once for a session, as requested by Jan.

Nov 25 '20 21:11 nalimilan

I was just thinking about nthreads! sounding a little weird because it's not a verb. Given that it's a global setting, would it make sense to create a struct that holds on to all global settings? Then, the user can configure it via setproperty!. Just a thought...

Nov 25 '20 22:11 tk3369

I was just thinking about nthreads! sounding a little weird because it's not a verb. Given that it's a global setting, would it make sense to create a struct that holds on to all global settings? Then, the user can configure it via setproperty!. Just a thought...

Yes, I also thought about setnthreads!. That's an eternal question about naming conventions for setters in Julia. But note that we have Random.seed! already in the stdlib, which is quite similar.

Adding a struct for global settings sounds overkill at this point since it's not clear we're going to add more settings in the future. I'd rather keep this minimal, so that it can be extended if needed in the future, e.g. using Preferences.jl for storage across sessions.

Nov 26 '20 09:11 nalimilan

I'd rather keep this minimal

I am OK with the current design as it is an implementation detail. Though probably having a mutable struct instead of Ref is more natural (but we can always change it in the future).

Regarding the naming - I am OK with nthreads!, but in general I do not have a super strong opinion here.

Nov 26 '20 12:11 bkamins

The Ref is an implementation detail, but we have to decide whether DataFrames.nthreads and DataFrames.nthreads! are part of the public API. If we're hesitant we could avoid mentioning these in the docs and only use them for benchmarking for now (like DataFrames.precompile).

Nov 26 '20 12:11 nalimilan

I am OK to have them as a part of the public API.

@StefanKarpinski - if Julia Base were to add an option to change number of available threads dynamically during the Julia session, would Threads.nthreads! would be a name for this operation?

Nov 26 '20 12:11 bkamins

This can be closed given the other approach we took? Or we keep it open for the future?

Jan 24 '21 18:01 bkamins

This PR is complementary with the other threaded grouping PRs, since without it optimized reductions are not multithreaded. We should check whether we can identify conditions under which multithreading is faster than single threading.

Mar 06 '21 12:03 nalimilan

I have totally forgotten about this PR. I thought you have merged it :). So what is the status here - are you working on the condition or I should review it as is?

Thank you!

Mar 06 '21 17:03 bkamins

Given that we now automatically spawn as many tasks as there are threads in other places, requiring users to set NTHREADS for this function may be a bit weird. I guess we should try to automatically compute a reasonable number of threads -- but that turned out to be quite tricky.

Mar 06 '21 19:03 nalimilan

equiring users to set NTHREADS for this function may be a bit weird

I agree

but that turned out to be quite tricky.

Is it possible to give some safe condition though? (I do not remember the details of our earlier discussion)

Mar 06 '21 19:03 bkamins

Is it possible to give some safe condition though? (I do not remember the details of our earlier discussion)

Probably. I need to look at the data again. What's difficult is that there are several dimensions to cross: number of groups, number of rows, number of threads. And IIRC the overhead due to threading is larger when Julia was started with many threads.

Finally, there's always the possibility that all CPUs are already busy with another threaded operation, so that spawning multiple tasks is counter-productive. Maybe that's not a serious issue.

Mar 08 '21 09:03 nalimilan

Should I mark it as 1.0, or keep it 1.x and we will merge it when we are comfortable?

Mar 08 '21 09:03 bkamins

I think this doesn't have to happen before 1.0.

Mar 08 '21 20:03 nalimilan

@nalimilan - what do we do with this PR? Close it (as it is likely outdated)?

In general - for the future we should rather add a kwarg allowing to choose if users wants to use multithreading c.f. https://github.com/JuliaData/DataFrames.jl/issues/2992.

Jan 31 '22 10:01 bkamins

bump

@nalimilan - what do you think we should do with this PR?

Feb 20 '22 12:02 bkamins

I'm not sure. There are two aspects to address:

If we decide to add keyword arguments to choose the number of threads (https://github.com/JuliaData/DataFrames.jl/issues/2992, https://github.com/JuliaData/DataFrames.jl/issues/2988), it would make sense to (rebase and) merge this.
Otherwise, we would need to find conditions under which using multiple threads is always faster. I haven't retried.

Feb 20 '22 22:02 nalimilan

@nalimilan - following our discussion I recommend to do what is written in https://github.com/JuliaData/DataFrames.jl/issues/2988#issuecomment-1046664288.

We would then implement this proposed change. There are the following things to be done to implement this change:

implement the feature of this PR
implement enabling/disabling threading of what you propose in this PR
implement enabling/disabling threading in general in _combine (across transformation operations)
implement ntasks keyword argument in all relevant functions

@nalimilan - can you please comment which parts of this list you would be willing to have a look at (I understand that 1, 2, and 4 must be covered in this PR, but 3 can be done in a separate follow-up PR that I can do if you prefer after this PR is finished but before 1.4 release)

Additional maintenance tasks:

add tests for all cases
create and run benchmarks
update docstrings
update documentation
update NEWS.md

Feb 21 '22 09:02 bkamins

OK I'll rebase this PR, but I'd rather leave point 3 for another one.

Feb 23 '22 08:02 nalimilan

Sure. Thank you!

Feb 23 '22 10:02 bkamins

(probably starting a new branch and making proper changes + force push rather than rebase would be easier, as I am not sure by how much we have changed the sources you touched)

Feb 23 '22 10:02 bkamins

@nalimilan - what are your plans with this PR 😄? (I am asking because it is getting so old that there is a risk that it is better to re-implement it from scratch if we want to have it given the recent changes)

Jun 12 '22 18:06 bkamins

Now that we have the threads keyword argument, we could allow passing an integer to enable multithreaded groupreduce. But it would be kind of weird that the default threads=true wouldn't enable multithreading here, while e.g. threads=2 would. Also threads=2 wouldn't actually limit the number of tasks to 2 as the existing code doesn't really allow doing that (nor should it). And anyway many people wouldn't benefit from it as they will probably expect threading to be automatic, just like it is in other cases.

So it would probably be worth experimenting more to find conditions under which we can be sure starting multiple tasks is faster.

Jun 14 '22 07:06 nalimilan

OK - so let us keep it open for now.

Jun 14 '22 08:06 bkamins

I converted this PR to draft to avoid accidental merging.

Dec 02 '22 10:12 bkamins

DataFrames.jl DataFrames.jl copied to clipboard

Support multithreading in groupreduce

DataFrames.jl
DataFrames.jl copied to clipboard