root
root copied to clipboard
[ntuple] Add initial in-memory index prototype
This PR adds (a first version of) the RNTupleIndex
, which is an in-memory structure that maps RNTuple field values (or combinations thereof) to an entry index in the RNTuple for which the index was built. Currently, the index only resides in memory and thus has to be (re)build each time. RNTupleIndex
will be used by the RNTupleProcessor
to enable dataset joins and will be as transparent as possible to users. Currently, no public interface is foreseen.
At this point, no persistification is foreseen, but this might be added in the future. The implementation of the RNTupleIndex
in this PR is hash-based. An implementation that is vector-based (but with the same interface) will also be considered.
The idea is to benchmark and evaluate both implementations (and potentially more). Based on the results we can decide which one to actually use (or alternatively make multiple implementations available if they show clear tradeoffs in different use cases).
Starting build on ROOT-performance-centos8-multicore
/soversion
, ROOT-ubuntu2204
/nortcxxmod
, ROOT-ubuntu2004
/python3
, mac12arm
/cxx20
, windows10
/default
How to customize builds
Build failed on ROOT-ubuntu2204/nortcxxmod. Running on root-ubuntu-2204-3.cern.ch:/home/sftnight/build/workspace/root-pullrequests-build See console output.
Failing tests:
Build failed on mac12arm/cxx20. Running on 194.12.161.128:/Users/sftnight/build/workspace/root-pullrequests-build See console output.
Errors:
- [2024-04-03T09:41:19.032Z] FAILED: tree/ntuple/CMakeFiles/ROOTNTuple.dir/v7/src/RField.cxx.o
- [2024-04-03T09:41:19.641Z] /Users/sftnight/build/workspace/root-pullrequests-build/root/tree/ntuple/v7/src/RField.cxx:1076:47: error: return type of out-of-line definition of 'ROOT::Experimental::RFieldBase::GetIndexRepresentation' differs from that in the declaration
Warnings:
- [2024-04-03T09:41:19.641Z] /Users/sftnight/build/workspace/root-pullrequests-build/root/tree/ntuple/v7/inc/ROOT/RNTupleIndex.hxx:76:68: warning: unused parameter 'lowerBound' [-Wunused-parameter]
- [2024-04-03T09:41:20.407Z] /Users/sftnight/build/workspace/root-pullrequests-build/root/tree/ntuple/v7/inc/ROOT/RNTupleIndex.hxx:76:68: warning: unused parameter 'lowerBound' [-Wunused-parameter]
Build failed on windows10/default. Running on null:C:\build\workspace\root-pullrequests-build See console output.
Errors:
- [2024-04-03T09:31:03.191Z] C:\build\workspace\root-pullrequests-build\root\tree\ntuple\v7\src\RField.cxx(1077,1): error C2556: 'uint64_t ROOT::Experimental::RFieldBase::GetIndexRepresentation(void *)': overloaded function differs only by return type from 'ROOT::Experimental::NTupleIndexValue_t ROOT::Experimental::RFieldBase::GetIndexRepresentation(void *)' [C:\build\workspace\root-pullrequests-build\build\tree\ntuple\ROOTNTuple.vcxproj]
- [2024-04-03T09:31:03.192Z] C:\build\workspace\root-pullrequests-build\root\tree\ntuple\v7\src\RField.cxx(1076,47): error C2371: 'ROOT::Experimental::RFieldBase::GetIndexRepresentation': redefinition; different basic types [C:\build\workspace\root-pullrequests-build\build\tree\ntuple\ROOTNTuple.vcxproj]
Test Results
14 files 14 suites 3d 13h 36m 59s :stopwatch: 2 697 tests 2 697 :white_check_mark: 0 :zzz: 0 :x: 35 511 runs 35 511 :white_check_mark: 0 :zzz: 0 :x:
Results for commit 68e90b34.
:recycle: This comment has been updated with latest results.
Build failed on ROOT-ubuntu2004/python3. Running on root-ubuntu-2004-1.cern.ch:/home/sftnight/build/workspace/root-pullrequests-build See console output.
While thinking about collisions and the index storage, one thought that crossed my mind is to template the index based on the index field type(s). I'm not sure if that's a good idea, one of the immediate question being "do we want RNTupleIndex<std::string>
and RNTupleIndex<std::uint64_t>
to be different types?" But still maybe something to consider, it might simplify the field value storage (if needed) and the entire hashing business...
While thinking about collisions and the index storage, one thought that crossed my mind is to template the index based on the index field type(s). I'm not sure if that's a good idea, one of the immediate question being "do we want
RNTupleIndex<std::string>
andRNTupleIndex<std::uint64_t>
to be different types?" But still maybe something to consider, it might simplify the field value storage (if needed) and the entire hashing business...
This is how I initially implemented it. Indeed it makes the index itself much more straightforward. However, when I started prototyping the actual join/unaligned friends it really didn't work without making that interface overly complicated so in the end I opted for a non-templated version. Perhaps there's some template trickery to still make it play nice with the foreseen interface (or allow for a slightly different but still simple enough interface), I will think about it for a bit.
Not sure if it applies here, but a common "template trickery" is to have a type-punned base class that provides enough functionality to cater for the hot paths during the event loop. We already use this in RNTuple, for example with RFieldBase
that can be conveniently passed around if you only need its interface functions.
After going back and forth between different resolutions, I've decided to for now only allow integral-type fields as index fields (thus dropping floating-point types and strings), and storing their values as std::uint64_t
. For now, we don't have any concrete examples where a non-integral field is desired to be used for indexing, so to stay within the scope of "initial prototype", I think this approach should suffice.
While it will solve the potential index collisions, storing the separate values will incur some memory overhead, but preliminary benchmarks indicate it should be manageable (to be followed-up, though).
I've decided to for now only allow integral-type fields as index fields (thus dropping floating-point types
Good. Having a floating point 'as is' as the index would be a disaster due to the instability of equality. (One could envision having a floating pointer index only the user provide a transformation from floating point to integer).