root icon indicating copy to clipboard operation
root copied to clipboard

[tree] Error out if entry is not found in friend tree

Open vepadulano opened this issue 1 year ago • 2 comments

This is to kickstart the discussion related to https://github.com/root-project/root/issues/7713

For the issue reproducer, this is the output with this patch

0 0
10 10
Error in <TTree::GetEntry>: Entry 2 not found in friend tree 'friend'
20 10

From here we can decide to go in many directions but I would like to get opinions first.

vepadulano avatar May 24 '24 15:05 vepadulano

The CI is confirming a suspect I had, that is in our tests using friend trees we have cases of entries not being found. Those will have to be investigated to understand if it's intended or if it's a bug

vepadulano avatar May 24 '24 17:05 vepadulano

Test Results

    14 files      14 suites   3d 2h 59m 32s :stopwatch:  2 704 tests  2 704 :white_check_mark: 0 :zzz: 0 :x: 35 594 runs  35 594 :white_check_mark: 0 :zzz: 0 :x:

Results for commit f744d44b.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar May 24 '24 18:05 github-actions[bot]

This PR is now based on https://github.com/root-project/root/pull/16162 and https://github.com/root-project/root/pull/16165

The changes are divided in two commits: those related to TTreeReader alone and those related to RDataFrame.

Tests are missing for the RDF-related changes, will be added depending on the first round of review

vepadulano avatar Sep 01 '24 13:09 vepadulano

(similar to TTree::Draw's $Alt operation)

This reminds me ... what about the case where the column is a collection? (In particular $Alt was implemented to fill-in collection within an entry (missing 'instances' more than missing entries)).

pcanal avatar Sep 07 '24 18:09 pcanal

This reminds me ... what about the case where the column is a collection? (In particular $Alt was implemented to fill-in collection within an entry (missing 'instances' more than missing entries)).

Could you provide an example of how this can happen? AFAIU, it is not possible to read an entry from a branch of type say std::vector<int> where some entries of the vector are available and some not (how would that even make sense?).

vepadulano avatar Sep 07 '24 21:09 vepadulano

@pcanal Tutorials and tests were added, passing on all platforms (windows failure is about the file used in the tutorial being stuck by another process). Let me know what you think, thanks!

vepadulano avatar Sep 12 '24 04:09 vepadulano

Could you provide an example of how this can happen? AFAIU, it is not possible to read an entry from a branch of type say std::vector where some entries of the vector are available and some not (how would that even make sense?).

It is happening when combining 2 columns from 2 distincts collection:

tree->Draw("vec_1.px + Alt$(vec_2.delta_px, 0)");

where vec_1.size() happens to be different from vec_2.sizeI() but we still want to plot the data for all element of vec_1

A completely related recent post on the forum: https://root-forum.cern.ch/t/dealing-with-columns-of-different-lengths-in-rdf/61642/2

pcanal avatar Sep 12 '24 17:09 pcanal

(windows failure is about the file used in the tutorial being stuck by another process).

However I am not seeing quickly the reason for it but if it happened here, it will still happen after merging, so we do not need to get to the bottom of it (i.e. fix/work-around it)

pcanal avatar Sep 12 '24 17:09 pcanal

Could you provide an example of how this can happen? AFAIU, it is not possible to read an entry from a branch of type say std::vector where some entries of the vector are available and some not (how would that even make sense?).

It is happening when combining 2 columns from 2 distincts collection:

tree->Draw("vec_1.px + Alt$(vec_2.delta_px, 0)");

where vec_1.size() happens to be different from vec_2.sizeI() but we still want to plot the data for all element of vec_1

A completely related recent post on the forum: https://root-forum.cern.ch/t/dealing-with-columns-of-different-lengths-in-rdf/61642/2

This use case is fairly simple with RDataFrame and has been supported for a very long time, also it is really unrelated to this PR. Here is an example of how to deal with such cases:

#include <ROOT/RDataFrame.hxx>
#include <ROOT/RVec.hxx>
#include <TFile.h>
#include <TTree.h>

struct Dataset
{
    constexpr static auto filename{"myfile.root"};
    constexpr static auto treename{"mytree"};
    Dataset()
    {
        TFile f{filename, "recreate"};
        TTree t{treename, treename};

        std::vector<float> vec1{1.1f, 2.2f, 3.3f, 4.4f, 5.5f};
        std::vector<float> vec2{6.6f, 7.7f};

        t.Branch("vec1", &vec1);
        t.Branch("vec2", &vec2);
        t.Fill();
        t.Write();
    }

    ~Dataset()
    {
        std::remove(filename);
    }
};

int main()
{
    Dataset dataset;
    ROOT::RDataFrame df{dataset.treename, dataset.filename};
    auto display = df.Define("vec3", [](const ROOT::RVecF &vec1, const ROOT::RVecF &vec2)
                             { return vec1 + ROOT::VecOps::Take(vec2, vec1.size(), 10.f); }, {"vec1", "vec2"})
                       .Display<ROOT::RVecF, ROOT::RVecF, ROOT::RVecF>({"vec1", "vec2", "vec3"});
    display->Print();
}

vepadulano avatar Sep 16 '24 07:09 vepadulano

However I am not seeing quickly the reason for it but if it happened here, it will still happen after merging, so we do not need to get to the bottom of it (i.e. fix/work-around it)

I will check this with Bertrand, but meanwhile we can go ahead with the review irrespective of the Windows failure

vepadulano avatar Sep 16 '24 07:09 vepadulano

This use case is fairly simple with RDataFrame and has been supported for a very long time, also it is really unrelated to this PR. Here is an example of how to deal with such cases: ....

Not easy to read/parse but fair enough.

pcanal avatar Sep 16 '24 17:09 pcanal

Alt$ vs RVec::Take with back fill value also it is really unrelated to this PR.

It is related to this PR as the 2 features needs to work together. (I.e. case where the user relied on RVec::Take with back fill value and there is some times where that column is also completely missing.

pcanal avatar Sep 16 '24 17:09 pcanal

Not easy to read/parse but fair enough.

I gave a fully compiled C++ reproducer, clearly the syntax can get as simple as

df.Define("vec3", "vec1 + Take(vec2, vec1.size(), 99)").Display({"vec3"});

Getting quite close to TTree::Draw's compactness. Although, for the sake of completeness, let me state that it is personally not my preferred way.

vepadulano avatar Sep 16 '24 20:09 vepadulano

Alt$ vs RVec::Take with back fill value also it is really unrelated to this PR.

It is related to this PR as the 2 features needs to work together. (I.e. case where the user relied on RVec::Take with back fill value and there is some times where that column is also completely missing.

The usage of ROOT::VecOps::Take to create an RVec filled with elements of an existing vector or a default user-provided value is a completely unrelated feature w.r.t. this PR.

But, reading this comment of yours leads me to thinking what you are looking for is a unit test that makes sure the handling of missing values works with vector-like input branches, including providing default values via DefaultFor for those branches. Some test like the following example

#include <ROOT/RDataFrame.hxx>
#include <ROOT/RVec.hxx>
#include <TFile.h>
#include <TTree.h>

struct Dataset
{
    constexpr static auto filename_1{"myfile_1.root"};
    constexpr static auto filename_2{"myfile_2.root"};
    constexpr static auto treename{"mytree"};
    Dataset()
    {
        {
            TFile f{filename_1, "recreate"};
            TTree t{treename, treename};

            std::vector<float> vec1{1.1f, 2.2f, 3.3f, 4.4f, 5.5f};
            std::vector<float> vec2{6.6f, 7.7f};

            t.Branch("vec1", &vec1);
            t.Branch("vec2", &vec2);
            t.Fill();
            t.Write();
        }

        {
            TFile f{filename_2, "recreate"};
            TTree t{treename, treename};

            std::vector<float> vec1{1.1f, 2.2f, 3.3f, 4.4f, 5.5f};

            t.Branch("vec1", &vec1);
            t.Fill();
            t.Write();
        }
    }

    ~Dataset()
    {
        std::remove(filename_1);
        std::remove(filename_2);
    }
};

int main()
{
    Dataset dataset;
    ROOT::RDataFrame df{dataset.treename, {dataset.filename_1, dataset.filename_2}};
    auto display = df
                       .DefaultFor("vec2", ROOT::RVecF{100.f, 200.f, 300.f})
                       .Define("vec3", [](const ROOT::RVecF &vec1, const ROOT::RVecF &vec2)
                               { return vec1 + ROOT::VecOps::Take(vec2, vec1.size(), 10.f); }, {"vec1", "vec2"})
                       .Display<ROOT::RVecF, ROOT::RVecF, ROOT::RVecF>({"vec1", "vec2", "vec3"});
    display->Print();
}

If this is what you are after, I agree with this idea and I will add it to the unit testing suite of this PR.

vepadulano avatar Sep 16 '24 21:09 vepadulano

If this is what you are after, I agree with this idea and I will add it to the unit testing suite of this PR.

Yes, this is exactly the kind of example I was looking for. Thanks.

pcanal avatar Sep 16 '24 22:09 pcanal

@martamaja10 @pcanal @hageboeck All comments addressed, ready for another round of review 😄

vepadulano avatar Sep 20 '24 13:09 vepadulano

@martamaja10 @pcanal @hageboeck @dpiparo The latest commits includes the final naming scheme for the API extension:

  • DefaultValueFor(colname, defaultval): lets the user provide one default value for the current entry of the input column, in case the value is missing.
  • FilterAvailable(colname): works in the same way as the traditional Filter operation, where the "expression" is "is the value available?". If so, the entry is kept, if not, it is discarded.
  • FilterMissing(colname): works in the same way as the traditional Filter operation, where the "expression" is "is the value missing?". If so, the entry is kept, if not, it is discarded.

vepadulano avatar Sep 23 '24 15:09 vepadulano

FilterMissing(colname): works in the same way as the traditional Filter operation, where the "expression" is "is the value missing?". If so, the entry is kept, if not, it is discarded.

If I read that correctly, it keeps only the entries for which that column has no entries, right? (i.e. as a consequence there should be no use of that column in the graph otherwise there will be an exception thrown, right?)

pcanal avatar Sep 23 '24 16:09 pcanal

If I read that correctly, it keeps only the entries for which that column has no entries, right? (i.e. as a consequence there should be not use of that column in the graph otherwise there will be an exception thrown, right?)

Yes and yes.

vepadulano avatar Sep 23 '24 16:09 vepadulano

Thanks Vincenzo! It all looks good to me - just one quick question, since now there is "FilterMissing" as well, should it be added to the tutorials as well or would that be too confusing - just thinking out loud here

martamaja10 avatar Sep 24 '24 07:09 martamaja10

Thanks Vincenzo! It all looks good to me - just one quick question, since now there is "FilterMissing" as well, should it be added to the tutorials as well or would that be too confusing - just thinking out loud here

Yes indeed I added it to all the tutorials in this PR, you can find it in the snippets preceded by the Example 3 comments.

vepadulano avatar Sep 24 '24 10:09 vepadulano

Yes indeed I added it to all the tutorials in this PR, you can find it in the snippets preceded by the Example 3 comments.

I must have missed it then, looks good this way! Approving the PR :)

martamaja10 avatar Sep 24 '24 10:09 martamaja10

Hi @vepadulano,

Should these three node types also be added in DistRDF/Operation.py ?

If yes, I can make a PR for that a bit like the Alias one

gpetruc avatar Dec 06 '24 09:12 gpetruc