root icon indicating copy to clipboard operation
root copied to clipboard

[df] Improve user interface in DistrRDF

Open martamaja10 opened this issue 1 year ago • 2 comments

Add a number of functions that could replace the way user interacts with the C++ elements of the distributed, fully pythonic RDF analysis, so that this is more natural and clear as well as less error prone compared to the previous method. The new functions are:

  • DeclareHeaders
  • DeclareSharedLibs
  • DeclareFiles
  • DeclareCppCode

The functions work with both Spark and Dask backends.

The tests are introduced in roottest PR: https://github.com/root-project/roottest/pull/1177

martamaja10 avatar Aug 26 '24 08:08 martamaja10

Test Results

    18 files      18 suites   4d 7h 34m 25s ⏱️  2 678 tests  2 678 ✅ 0 💤 0 ❌ 46 484 runs  46 484 ✅ 0 💤 0 ❌

Results for commit a0112de3.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Aug 26 '24 10:08 github-actions[bot]

  • An important part of this PR is deciding the naming scheme for the new user-facing API. I see two main ways: being pedantic or being coherent. Pedantic would mean that the function for headers should be IncludeHeaders, the one for shared libraries probably should be LoadSharedLibs etc. Instead if we want to be coherent and also hint at the fact that this is a tool for distributed execution, we could decide to name everything Distribute* so that it is already clear in the name of the function that the code/header/file will be uploaded to the workers somehow.

Personally, I would go for the Distribute* approach but we can discuss this next week in person.

  • I believe we need a section in the docs in RDataFrame.cxx describing these functions, with examples of usage. Ideally what I would like to see is something like a transition help guide, just to give an example
# If you do this in your local RDataFrame script
ROOT.gInterpreter.Declare("my_code")
df.Define(...)

# Do this in distributed RDF
ROOT.RDF.Distributed.DistributeCode("my_code")
df.Define(...)

sure, I will do that next.

I implemented all your comments (but the leftover debug statements - on purpose, as there are a few things we still need to debug - see my comments in the roottest PR - especially multiple declarations).

martamaja10 avatar Aug 29 '24 14:08 martamaja10