[tree] Error out if entry is not found in friend tree
This is to kickstart the discussion related to https://github.com/root-project/root/issues/7713
For the issue reproducer, this is the output with this patch
0 0
10 10
Error in <TTree::GetEntry>: Entry 2 not found in friend tree 'friend'
20 10
From here we can decide to go in many directions but I would like to get opinions first.
The CI is confirming a suspect I had, that is in our tests using friend trees we have cases of entries not being found. Those will have to be investigated to understand if it's intended or if it's a bug
Test Results
14 files 14 suites 3d 2h 59m 32s :stopwatch: 2 704 tests 2 704 :white_check_mark: 0 :zzz: 0 :x: 35 594 runs 35 594 :white_check_mark: 0 :zzz: 0 :x:
Results for commit f744d44b.
:recycle: This comment has been updated with latest results.
This PR is now based on https://github.com/root-project/root/pull/16162 and https://github.com/root-project/root/pull/16165
The changes are divided in two commits: those related to TTreeReader alone and those related to RDataFrame.
Tests are missing for the RDF-related changes, will be added depending on the first round of review
(similar to
TTree::Draw's$Altoperation)
This reminds me ... what about the case where the column is a collection? (In particular $Alt was implemented to fill-in collection within an entry (missing 'instances' more than missing entries)).
This reminds me ... what about the case where the column is a collection? (In particular $Alt was implemented to fill-in collection within an entry (missing 'instances' more than missing entries)).
Could you provide an example of how this can happen? AFAIU, it is not possible to read an entry from a branch of type say std::vector<int> where some entries of the vector are available and some not (how would that even make sense?).
@pcanal Tutorials and tests were added, passing on all platforms (windows failure is about the file used in the tutorial being stuck by another process). Let me know what you think, thanks!
Could you provide an example of how this can happen? AFAIU, it is not possible to read an entry from a branch of type say std::vector
where some entries of the vector are available and some not (how would that even make sense?).
It is happening when combining 2 columns from 2 distincts collection:
tree->Draw("vec_1.px + Alt$(vec_2.delta_px, 0)");
where vec_1.size() happens to be different from vec_2.sizeI() but we still want to plot the data for all element of vec_1
A completely related recent post on the forum: https://root-forum.cern.ch/t/dealing-with-columns-of-different-lengths-in-rdf/61642/2
(windows failure is about the file used in the tutorial being stuck by another process).
However I am not seeing quickly the reason for it but if it happened here, it will still happen after merging, so we do not need to get to the bottom of it (i.e. fix/work-around it)
Could you provide an example of how this can happen? AFAIU, it is not possible to read an entry from a branch of type say std::vector where some entries of the vector are available and some not (how would that even make sense?).
It is happening when combining 2 columns from 2 distincts collection:
tree->Draw("vec_1.px + Alt$(vec_2.delta_px, 0)");where
vec_1.size()happens to be different fromvec_2.sizeI()but we still want to plot the data for all element ofvec_1A completely related recent post on the forum: https://root-forum.cern.ch/t/dealing-with-columns-of-different-lengths-in-rdf/61642/2
This use case is fairly simple with RDataFrame and has been supported for a very long time, also it is really unrelated to this PR. Here is an example of how to deal with such cases:
#include <ROOT/RDataFrame.hxx>
#include <ROOT/RVec.hxx>
#include <TFile.h>
#include <TTree.h>
struct Dataset
{
constexpr static auto filename{"myfile.root"};
constexpr static auto treename{"mytree"};
Dataset()
{
TFile f{filename, "recreate"};
TTree t{treename, treename};
std::vector<float> vec1{1.1f, 2.2f, 3.3f, 4.4f, 5.5f};
std::vector<float> vec2{6.6f, 7.7f};
t.Branch("vec1", &vec1);
t.Branch("vec2", &vec2);
t.Fill();
t.Write();
}
~Dataset()
{
std::remove(filename);
}
};
int main()
{
Dataset dataset;
ROOT::RDataFrame df{dataset.treename, dataset.filename};
auto display = df.Define("vec3", [](const ROOT::RVecF &vec1, const ROOT::RVecF &vec2)
{ return vec1 + ROOT::VecOps::Take(vec2, vec1.size(), 10.f); }, {"vec1", "vec2"})
.Display<ROOT::RVecF, ROOT::RVecF, ROOT::RVecF>({"vec1", "vec2", "vec3"});
display->Print();
}
However I am not seeing quickly the reason for it but if it happened here, it will still happen after merging, so we do not need to get to the bottom of it (i.e. fix/work-around it)
I will check this with Bertrand, but meanwhile we can go ahead with the review irrespective of the Windows failure
This use case is fairly simple with RDataFrame and has been supported for a very long time, also it is really unrelated to this PR. Here is an example of how to deal with such cases: ....
Not easy to read/parse but fair enough.
Alt$vsRVec::Take with back fill valuealso it is really unrelated to this PR.
It is related to this PR as the 2 features needs to work together. (I.e. case where the user relied on RVec::Take with back fill value and there is some times where that column is also completely missing.
Not easy to read/parse but fair enough.
I gave a fully compiled C++ reproducer, clearly the syntax can get as simple as
df.Define("vec3", "vec1 + Take(vec2, vec1.size(), 99)").Display({"vec3"});
Getting quite close to TTree::Draw's compactness. Although, for the sake of completeness, let me state that it is personally not my preferred way.
Alt$vsRVec::Take with back fill valuealso it is really unrelated to this PR.It is related to this PR as the 2 features needs to work together. (I.e. case where the user relied on
RVec::Take with back fill valueand there is some times where that column is also completely missing.
The usage of ROOT::VecOps::Take to create an RVec filled with elements of an existing vector or a default user-provided value is a completely unrelated feature w.r.t. this PR.
But, reading this comment of yours leads me to thinking what you are looking for is a unit test that makes sure the handling of missing values works with vector-like input branches, including providing default values via DefaultFor for those branches. Some test like the following example
#include <ROOT/RDataFrame.hxx>
#include <ROOT/RVec.hxx>
#include <TFile.h>
#include <TTree.h>
struct Dataset
{
constexpr static auto filename_1{"myfile_1.root"};
constexpr static auto filename_2{"myfile_2.root"};
constexpr static auto treename{"mytree"};
Dataset()
{
{
TFile f{filename_1, "recreate"};
TTree t{treename, treename};
std::vector<float> vec1{1.1f, 2.2f, 3.3f, 4.4f, 5.5f};
std::vector<float> vec2{6.6f, 7.7f};
t.Branch("vec1", &vec1);
t.Branch("vec2", &vec2);
t.Fill();
t.Write();
}
{
TFile f{filename_2, "recreate"};
TTree t{treename, treename};
std::vector<float> vec1{1.1f, 2.2f, 3.3f, 4.4f, 5.5f};
t.Branch("vec1", &vec1);
t.Fill();
t.Write();
}
}
~Dataset()
{
std::remove(filename_1);
std::remove(filename_2);
}
};
int main()
{
Dataset dataset;
ROOT::RDataFrame df{dataset.treename, {dataset.filename_1, dataset.filename_2}};
auto display = df
.DefaultFor("vec2", ROOT::RVecF{100.f, 200.f, 300.f})
.Define("vec3", [](const ROOT::RVecF &vec1, const ROOT::RVecF &vec2)
{ return vec1 + ROOT::VecOps::Take(vec2, vec1.size(), 10.f); }, {"vec1", "vec2"})
.Display<ROOT::RVecF, ROOT::RVecF, ROOT::RVecF>({"vec1", "vec2", "vec3"});
display->Print();
}
If this is what you are after, I agree with this idea and I will add it to the unit testing suite of this PR.
If this is what you are after, I agree with this idea and I will add it to the unit testing suite of this PR.
Yes, this is exactly the kind of example I was looking for. Thanks.
@martamaja10 @pcanal @hageboeck All comments addressed, ready for another round of review 😄
@martamaja10 @pcanal @hageboeck @dpiparo The latest commits includes the final naming scheme for the API extension:
DefaultValueFor(colname, defaultval): lets the user provide one default value for the current entry of the input column, in case the value is missing.FilterAvailable(colname): works in the same way as the traditionalFilteroperation, where the "expression" is "is the value available?". If so, the entry is kept, if not, it is discarded.FilterMissing(colname): works in the same way as the traditionalFilteroperation, where the "expression" is "is the value missing?". If so, the entry is kept, if not, it is discarded.
FilterMissing(colname): works in the same way as the traditional Filter operation, where the "expression" is "is the value missing?". If so, the entry is kept, if not, it is discarded.
If I read that correctly, it keeps only the entries for which that column has no entries, right? (i.e. as a consequence there should be no use of that column in the graph otherwise there will be an exception thrown, right?)
If I read that correctly, it keeps only the entries for which that column has no entries, right? (i.e. as a consequence there should be not use of that column in the graph otherwise there will be an exception thrown, right?)
Yes and yes.
Thanks Vincenzo! It all looks good to me - just one quick question, since now there is "FilterMissing" as well, should it be added to the tutorials as well or would that be too confusing - just thinking out loud here
Thanks Vincenzo! It all looks good to me - just one quick question, since now there is "FilterMissing" as well, should it be added to the tutorials as well or would that be too confusing - just thinking out loud here
Yes indeed I added it to all the tutorials in this PR, you can find it in the snippets preceded by the Example 3 comments.
Yes indeed I added it to all the tutorials in this PR, you can find it in the snippets preceded by the
Example 3comments.
I must have missed it then, looks good this way! Approving the PR :)
Hi @vepadulano,
Should these three node types also be added in DistRDF/Operation.py ?
If yes, I can make a PR for that a bit like the Alias one