root icon indicating copy to clipboard operation
root copied to clipboard

[ntuple] Add initial in-memory index prototype

Open enirolf opened this issue 10 months ago • 9 comments

This PR adds (a first version of) the RNTupleIndex, which is an in-memory structure that maps RNTuple field values (or combinations thereof) to an entry index in the RNTuple for which the index was built. Currently, the index only resides in memory and thus has to be (re)build each time. RNTupleIndex will be used by the RNTupleProcessor to enable dataset joins and will be as transparent as possible to users. Currently, no public interface is foreseen.

At this point, no persistification is foreseen, but this might be added in the future. The implementation of the RNTupleIndex in this PR is hash-based. An implementation that is vector-based (but with the same interface) will also be considered.

The idea is to benchmark and evaluate both implementations (and potentially more). Based on the results we can decide which one to actually use (or alternatively make multiple implementations available if they show clear tradeoffs in different use cases).

enirolf avatar Apr 03 '24 08:04 enirolf

Starting build on ROOT-performance-centos8-multicore/soversion, ROOT-ubuntu2204/nortcxxmod, ROOT-ubuntu2004/python3, mac12arm/cxx20, windows10/default How to customize builds

phsft-bot avatar Apr 03 '24 08:04 phsft-bot

Build failed on ROOT-ubuntu2204/nortcxxmod. Running on root-ubuntu-2204-3.cern.ch:/home/sftnight/build/workspace/root-pullrequests-build See console output.

Failing tests:

phsft-bot avatar Apr 03 '24 09:04 phsft-bot

Build failed on mac12arm/cxx20. Running on 194.12.161.128:/Users/sftnight/build/workspace/root-pullrequests-build See console output.

Errors:

  • [2024-04-03T09:41:19.032Z] FAILED: tree/ntuple/CMakeFiles/ROOTNTuple.dir/v7/src/RField.cxx.o
  • [2024-04-03T09:41:19.641Z] /Users/sftnight/build/workspace/root-pullrequests-build/root/tree/ntuple/v7/src/RField.cxx:1076:47: error: return type of out-of-line definition of 'ROOT::Experimental::RFieldBase::GetIndexRepresentation' differs from that in the declaration

Warnings:

  • [2024-04-03T09:41:19.641Z] /Users/sftnight/build/workspace/root-pullrequests-build/root/tree/ntuple/v7/inc/ROOT/RNTupleIndex.hxx:76:68: warning: unused parameter 'lowerBound' [-Wunused-parameter]
  • [2024-04-03T09:41:20.407Z] /Users/sftnight/build/workspace/root-pullrequests-build/root/tree/ntuple/v7/inc/ROOT/RNTupleIndex.hxx:76:68: warning: unused parameter 'lowerBound' [-Wunused-parameter]

phsft-bot avatar Apr 03 '24 09:04 phsft-bot

Build failed on windows10/default. Running on null:C:\build\workspace\root-pullrequests-build See console output.

Errors:

  • [2024-04-03T09:31:03.191Z] C:\build\workspace\root-pullrequests-build\root\tree\ntuple\v7\src\RField.cxx(1077,1): error C2556: 'uint64_t ROOT::Experimental::RFieldBase::GetIndexRepresentation(void *)': overloaded function differs only by return type from 'ROOT::Experimental::NTupleIndexValue_t ROOT::Experimental::RFieldBase::GetIndexRepresentation(void *)' [C:\build\workspace\root-pullrequests-build\build\tree\ntuple\ROOTNTuple.vcxproj]
  • [2024-04-03T09:31:03.192Z] C:\build\workspace\root-pullrequests-build\root\tree\ntuple\v7\src\RField.cxx(1076,47): error C2371: 'ROOT::Experimental::RFieldBase::GetIndexRepresentation': redefinition; different basic types [C:\build\workspace\root-pullrequests-build\build\tree\ntuple\ROOTNTuple.vcxproj]

phsft-bot avatar Apr 03 '24 09:04 phsft-bot

Test Results

    14 files      14 suites   3d 13h 36m 59s :stopwatch:  2 697 tests  2 697 :white_check_mark: 0 :zzz: 0 :x: 35 511 runs  35 511 :white_check_mark: 0 :zzz: 0 :x:

Results for commit 68e90b34.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Apr 03 '24 09:04 github-actions[bot]

Build failed on ROOT-ubuntu2004/python3. Running on root-ubuntu-2004-1.cern.ch:/home/sftnight/build/workspace/root-pullrequests-build See console output.

phsft-bot avatar Apr 03 '24 14:04 phsft-bot

While thinking about collisions and the index storage, one thought that crossed my mind is to template the index based on the index field type(s). I'm not sure if that's a good idea, one of the immediate question being "do we want RNTupleIndex<std::string> and RNTupleIndex<std::uint64_t> to be different types?" But still maybe something to consider, it might simplify the field value storage (if needed) and the entire hashing business...

hahnjo avatar Jul 22 '24 13:07 hahnjo

While thinking about collisions and the index storage, one thought that crossed my mind is to template the index based on the index field type(s). I'm not sure if that's a good idea, one of the immediate question being "do we want RNTupleIndex<std::string> and RNTupleIndex<std::uint64_t> to be different types?" But still maybe something to consider, it might simplify the field value storage (if needed) and the entire hashing business...

This is how I initially implemented it. Indeed it makes the index itself much more straightforward. However, when I started prototyping the actual join/unaligned friends it really didn't work without making that interface overly complicated so in the end I opted for a non-templated version. Perhaps there's some template trickery to still make it play nice with the foreseen interface (or allow for a slightly different but still simple enough interface), I will think about it for a bit.

enirolf avatar Aug 13 '24 07:08 enirolf

Not sure if it applies here, but a common "template trickery" is to have a type-punned base class that provides enough functionality to cater for the hot paths during the event loop. We already use this in RNTuple, for example with RFieldBase that can be conveniently passed around if you only need its interface functions.

hahnjo avatar Aug 14 '24 11:08 hahnjo

After going back and forth between different resolutions, I've decided to for now only allow integral-type fields as index fields (thus dropping floating-point types and strings), and storing their values as std::uint64_t. For now, we don't have any concrete examples where a non-integral field is desired to be used for indexing, so to stay within the scope of "initial prototype", I think this approach should suffice.

While it will solve the potential index collisions, storing the separate values will incur some memory overhead, but preliminary benchmarks indicate it should be manageable (to be followed-up, though).

enirolf avatar Sep 10 '24 15:09 enirolf

I've decided to for now only allow integral-type fields as index fields (thus dropping floating-point types

Good. Having a floating point 'as is' as the index would be a disaster due to the instability of equality. (One could envision having a floating pointer index only the user provide a transformation from floating point to integer).

pcanal avatar Sep 20 '24 21:09 pcanal