[df] Improve user interface in DistrRDF
Add a number of functions that could replace the way user interacts with the C++ elements of the distributed, fully pythonic RDF analysis, so that this is more natural and clear as well as less error prone compared to the previous method. The new functions are:
- DeclareHeaders
- DeclareSharedLibs
- DeclareFiles
- DeclareCppCode
The functions work with both Spark and Dask backends.
The tests are introduced in roottest PR: https://github.com/root-project/roottest/pull/1177
Test Results
18 files 18 suites 4d 7h 34m 25s ⏱️ 2 678 tests 2 678 ✅ 0 💤 0 ❌ 46 484 runs 46 484 ✅ 0 💤 0 ❌
Results for commit a0112de3.
:recycle: This comment has been updated with latest results.
- An important part of this PR is deciding the naming scheme for the new user-facing API. I see two main ways: being pedantic or being coherent. Pedantic would mean that the function for headers should be
IncludeHeaders, the one for shared libraries probably should beLoadSharedLibsetc. Instead if we want to be coherent and also hint at the fact that this is a tool for distributed execution, we could decide to name everythingDistribute*so that it is already clear in the name of the function that the code/header/file will be uploaded to the workers somehow.
Personally, I would go for the Distribute* approach but we can discuss this next week in person.
- I believe we need a section in the docs in
RDataFrame.cxxdescribing these functions, with examples of usage. Ideally what I would like to see is something like a transition help guide, just to give an example# If you do this in your local RDataFrame script ROOT.gInterpreter.Declare("my_code") df.Define(...) # Do this in distributed RDF ROOT.RDF.Distributed.DistributeCode("my_code") df.Define(...)
sure, I will do that next.
I implemented all your comments (but the leftover debug statements - on purpose, as there are a few things we still need to debug - see my comments in the roottest PR - especially multiple declarations).