DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Support multithreading in groupreduce

Open nalimilan opened this issue 3 years ago • 30 comments

Keep the default to a single thread until we find a reliable way of predicting a reasonably optimal number of threads.

nalimilan avatar Oct 17 '20 21:10 nalimilan

Note that after https://github.com/JuliaData/DataFrames.jl/pull/2481 method signatures will change so this PR will have merge conflicts.

bkamins avatar Oct 17 '20 22:10 bkamins

I've rebased against master. Should be good for a review now. I just hope CI really used 2 threads but it's not visible in the logs AFAICT. EDIT: seeing Codecov, probably not.

nalimilan avatar Nov 25 '20 10:11 nalimilan

EDIT: seeing Codecov, probably not.

@quinnj - could you please have a quick look at the CI set-up? Thank you!

bkamins avatar Nov 25 '20 10:11 bkamins

Wait, I just did something completely stupid when setting the environment variable. I've committed a probable fix.

nalimilan avatar Nov 25 '20 10:11 nalimilan

OK that worked!

nalimilan avatar Nov 25 '20 10:11 nalimilan

I've added DataFrames.nthreads (called to choose the default nthreads) and DataFrames.nthreads! to set it, to make it easier to choose the number of threads once for a session, as requested by Jan.

nalimilan avatar Nov 25 '20 21:11 nalimilan

I was just thinking about nthreads! sounding a little weird because it's not a verb. Given that it's a global setting, would it make sense to create a struct that holds on to all global settings? Then, the user can configure it via setproperty!. Just a thought...

tk3369 avatar Nov 25 '20 22:11 tk3369

I was just thinking about nthreads! sounding a little weird because it's not a verb. Given that it's a global setting, would it make sense to create a struct that holds on to all global settings? Then, the user can configure it via setproperty!. Just a thought...

Yes, I also thought about setnthreads!. That's an eternal question about naming conventions for setters in Julia. But note that we have Random.seed! already in the stdlib, which is quite similar.

Adding a struct for global settings sounds overkill at this point since it's not clear we're going to add more settings in the future. I'd rather keep this minimal, so that it can be extended if needed in the future, e.g. using Preferences.jl for storage across sessions.

nalimilan avatar Nov 26 '20 09:11 nalimilan

I'd rather keep this minimal

I am OK with the current design as it is an implementation detail. Though probably having a mutable struct instead of Ref is more natural (but we can always change it in the future).

Regarding the naming - I am OK with nthreads!, but in general I do not have a super strong opinion here.

bkamins avatar Nov 26 '20 12:11 bkamins

The Ref is an implementation detail, but we have to decide whether DataFrames.nthreads and DataFrames.nthreads! are part of the public API. If we're hesitant we could avoid mentioning these in the docs and only use them for benchmarking for now (like DataFrames.precompile).

nalimilan avatar Nov 26 '20 12:11 nalimilan

I am OK to have them as a part of the public API.

@StefanKarpinski - if Julia Base were to add an option to change number of available threads dynamically during the Julia session, would Threads.nthreads! would be a name for this operation?

bkamins avatar Nov 26 '20 12:11 bkamins

This can be closed given the other approach we took? Or we keep it open for the future?

bkamins avatar Jan 24 '21 18:01 bkamins

This PR is complementary with the other threaded grouping PRs, since without it optimized reductions are not multithreaded. We should check whether we can identify conditions under which multithreading is faster than single threading.

nalimilan avatar Mar 06 '21 12:03 nalimilan

I have totally forgotten about this PR. I thought you have merged it :). So what is the status here - are you working on the condition or I should review it as is?

Thank you!

bkamins avatar Mar 06 '21 17:03 bkamins

Given that we now automatically spawn as many tasks as there are threads in other places, requiring users to set NTHREADS for this function may be a bit weird. I guess we should try to automatically compute a reasonable number of threads -- but that turned out to be quite tricky.

nalimilan avatar Mar 06 '21 19:03 nalimilan

equiring users to set NTHREADS for this function may be a bit weird

I agree

but that turned out to be quite tricky.

Is it possible to give some safe condition though? (I do not remember the details of our earlier discussion)

bkamins avatar Mar 06 '21 19:03 bkamins

Is it possible to give some safe condition though? (I do not remember the details of our earlier discussion)

Probably. I need to look at the data again. What's difficult is that there are several dimensions to cross: number of groups, number of rows, number of threads. And IIRC the overhead due to threading is larger when Julia was started with many threads.

Finally, there's always the possibility that all CPUs are already busy with another threaded operation, so that spawning multiple tasks is counter-productive. Maybe that's not a serious issue.

nalimilan avatar Mar 08 '21 09:03 nalimilan

Should I mark it as 1.0, or keep it 1.x and we will merge it when we are comfortable?

bkamins avatar Mar 08 '21 09:03 bkamins

I think this doesn't have to happen before 1.0.

nalimilan avatar Mar 08 '21 20:03 nalimilan

@nalimilan - what do we do with this PR? Close it (as it is likely outdated)?

In general - for the future we should rather add a kwarg allowing to choose if users wants to use multithreading c.f. https://github.com/JuliaData/DataFrames.jl/issues/2992.

bkamins avatar Jan 31 '22 10:01 bkamins

bump

@nalimilan - what do you think we should do with this PR?

bkamins avatar Feb 20 '22 12:02 bkamins

I'm not sure. There are two aspects to address:

  • If we decide to add keyword arguments to choose the number of threads (https://github.com/JuliaData/DataFrames.jl/issues/2992, https://github.com/JuliaData/DataFrames.jl/issues/2988), it would make sense to (rebase and) merge this.
  • Otherwise, we would need to find conditions under which using multiple threads is always faster. I haven't retried.

nalimilan avatar Feb 20 '22 22:02 nalimilan

@nalimilan - following our discussion I recommend to do what is written in https://github.com/JuliaData/DataFrames.jl/issues/2988#issuecomment-1046664288.

We would then implement this proposed change. There are the following things to be done to implement this change:

  1. implement the feature of this PR
  2. implement enabling/disabling threading of what you propose in this PR
  3. implement enabling/disabling threading in general in _combine (across transformation operations)
  4. implement ntasks keyword argument in all relevant functions

@nalimilan - can you please comment which parts of this list you would be willing to have a look at (I understand that 1, 2, and 4 must be covered in this PR, but 3 can be done in a separate follow-up PR that I can do if you prefer after this PR is finished but before 1.4 release)

Additional maintenance tasks:

  • add tests for all cases
  • create and run benchmarks
  • update docstrings
  • update documentation
  • update NEWS.md

bkamins avatar Feb 21 '22 09:02 bkamins

OK I'll rebase this PR, but I'd rather leave point 3 for another one.

nalimilan avatar Feb 23 '22 08:02 nalimilan

Sure. Thank you!

bkamins avatar Feb 23 '22 10:02 bkamins

(probably starting a new branch and making proper changes + force push rather than rebase would be easier, as I am not sure by how much we have changed the sources you touched)

bkamins avatar Feb 23 '22 10:02 bkamins

@nalimilan - what are your plans with this PR 😄? (I am asking because it is getting so old that there is a risk that it is better to re-implement it from scratch if we want to have it given the recent changes)

bkamins avatar Jun 12 '22 18:06 bkamins

Now that we have the threads keyword argument, we could allow passing an integer to enable multithreaded groupreduce. But it would be kind of weird that the default threads=true wouldn't enable multithreading here, while e.g. threads=2 would. Also threads=2 wouldn't actually limit the number of tasks to 2 as the existing code doesn't really allow doing that (nor should it). And anyway many people wouldn't benefit from it as they will probably expect threading to be automatic, just like it is in other cases.

So it would probably be worth experimenting more to find conditions under which we can be sure starting multiple tasks is faster.

nalimilan avatar Jun 14 '22 07:06 nalimilan

OK - so let us keep it open for now.

bkamins avatar Jun 14 '22 08:06 bkamins

I converted this PR to draft to avoid accidental merging.

bkamins avatar Dec 02 '22 10:12 bkamins