root icon indicating copy to clipboard operation
root copied to clipboard

[DF] Add initial implementation for snapshotting to RNTuple

Open enirolf opened this issue 8 months ago • 1 comments

This PR adds a first iteration of snapshotting to RNTuple from an RDataFrame. It uses the existing Snapshot interface, with an addition to RSnapshotOptions, kOutputFormat. This option can be set to write to either TTree, RNTuple, or take the default choice. The table below describes how Snapshot behaves accoring to the output format option:

From TTree From RNTuple From other DS
To TTree ESnapshotOutputFormat::kDefault ESnapshotOutputFormat::kTTree ESnapshotOutputFormat::kDefault
To RNTuple Not yet possible, will be added in a follow-up, using functionality from RNTupleImporter ESnapshotOutputFormat::kDefault ESnapshotOutputFormat::kRNTuple

Implementation

As mentioned, the existing Snapshot interface is used. A new SnapshotRNTupleHelper has been created to handle the creation and writing of the RNTuple, akin to the existing SnapshotHelper (which has been renamed to SnapshotTTreeHelper for consistency).

RLoopManager data source initialization (rev bbf221f)

The snapshot action creates a new loop manager which manages the snapshotted data set. The loop manager gets initialized before the actual snapshotting takes place. Originally, the pointer to the data source owned by the loop manager was marked as const. Because the RNTuple's data source has to be created after the loop manager, for this PR the const qualifier has been dropped.

Move ROOT::RDF::Experimental::FromRNTuple (rev 0a29b02)

For snapshotting RNTuples, we need to include the header file for RNTupleDS in ActionHelpers.hxx. To avoid dependency conflicts related to including ROOT/RDataFrame.hxx, the free FromRNTuple functions have been moved to a separate header.

Current limitations and follow-ups

This PR adds the minimal functionality for (single-threaded) snapshotting to RNTuple. A number of follow-ups are foreseen:

RNTuple write options

Currently no RNTuple-specific write options have been added to RSnapshotOptions yet, except for compression settings which were already present as an option. Adding (a subset) of the other RNTupleWriteOptions is trivial.

Default compression settings

RSnapshotOptions' default compression setting is 101 (Zlib). However, RNTuple's default compression setting is 505 (zstd). We could change the default compression setting to kInherit and decide which settings to use according to the target data format (unless explicitly set by the user, of course).

Multithreaded snapshotting

This PR only adds single-threaded RNTuple snapshotting. Multithreaded (and parallel) snapshotting will be addressed in a follow-up PR.

Tests

Corresponding roottest PR: https://github.com/root-project/roottest/pull/1178

Tests for Windows have been disabled, due to permission denied-errors related to trying to recreate currently open TFiles. The regular snapshot tests have also been disabled for Windows, presumably for the same reason.

enirolf avatar Jun 04 '24 16:06 enirolf