Enhanced functionality for RDF Define
This pull requests adds additional functionality to RDF Define in c++ in order to facilitate new features via pythonization.
New features in c++:
- more flexibility in template arguments in order to allow argument and return types to be specified independently of the callable type (this means types don't have to match exactly and implicit conversions between float and double, etc are possible) (Specification of Define result type independent from inferred function return type may facilitate future optimizations related to suppressing dynamic memory allocation.)
- support for callables with overloaded operator() as long as argument types are explicitly specified
- reduced copying/moving of callable
New features from pythonization of Define:
- support for free functions, static class member functions, and bound class member functions, with or without additional template arguments, and compatible with overloaded functions transparently from pyroot.
- support for expression strings, and a complete lambda expression passed as a string
- automatic inference of types from the graph (which don't have to exactly match the callable argument types as long as implicit conversion is available)
- automatic inference of column names from callables in case column names are not explicitly provided (but only when available and unambiguous)
- all variations are jitted with fully templated types for maximum possible inlining
given the following defined in c++
float squared(float x) { return x*x; }
double squared(double x) { return x*x; }
template<typename T>
T squared(T x) { return x*x; }
float squared2(float x) { return x*x; }
double squared2(double y) { return y*y; }
class Callable {
public:
float operator() (float x) { return x*x; }
double operator() (double x) { return x*x; }
template<typename T>
T operator() (T x) { return x*x; }
float squared(float x) { return x*x; }
double squared(double x) { return x*x; }
template<typename T>
T squared(T x) { return x*x; }
static float mul(float x, float y) { return x*y; }
};
then in python one can do
d = ROOT.ROOT.RDataFrame(chain)
#overload resolved by type of column
d = d.Define("overloadedFreePtsq", ROOT.squared, ["Muon_pt"])
#works as long as there is a column "x" with type implicitly convertible to float or double
d = d.Define("overloadedFreePtsq", ROOT.squared)
#will fail because argument name determination from overloads is ambiguous
d = d.Define("overloadedFreePtsq", ROOT.squared2)
#templated free function
d = d.Define("templatedFreeNmuonsSq", ROOT.squared["int"], ["nMuons"])
#static class member
d = d.Define("staticPtsq", ROOT.Callable.mul, ["Muon_pt", "Muon_pt"])
#class member
d = d.Define("overloadedMemberPtsq", ROOT.Callable().squared, ["Muon_pt"])
#templated class member
d = d.Define("templatedMemberNmuonsSq", ROOT.Callable().squared["int"], ["nMuons"])
#overloaded operator()
d = d.Define("overloadedCallPtsq", ROOT.Callable(), ["Muon_pt"])
#string expression
d = d.Define("lambdaPtsq", "Muon_pt*Muon_pt")
#complete lambda expression (direct jitting without parsing)
d = d.Define("lambdaPtsq", "[](float x) { return x*x; }", ["Muon_pt"])
#complete lambda expression with inferred column names and argument types (direct jitting without parsing, argument names determined from cling after jitting)
d = d.Define("lambdaAutoPtsq", "[](auto Muon_pt) { return Muon_pt*Muon_pt; }")
A few remaining issues here:
-
"Warning: failed - offset calculation between TList and TViewPubFunctions" in some cases (related to extracting argument names through cling machinery)
-
cppyy gives limited or obscure errors on template instantiation. In this case the templates are instantiated in code compiled by TClingCallFunc::compile_wrapper() which apparently has different verbosity than TCling::Declare (jitting through this route also does not forcibly disable the null pointer check like Declare does, and possibly other subtleties) In particular there's one nasty case which is fairly common and produces totally incomprehensible errors compared to compiling the same code in gcc (or even with TCling::Declare), where an incorrect number of columns is passed to Declare compare to the callable.
All of the above functionality can be easily extended to Filter once we're happy with the direction.
Can one of the admins verify this patch?
This pull request introduces 4 alerts when merging 5f7bce7adc8c414634c02090c6791c54fff0fc95 into c418ceb17affddc4879f446cf27c93d5eeaa6388 - view on LGTM.com
new alerts:
- 3 for Module is imported with 'import' and 'import from'
- 1 for Unused local variable
This pull request introduces 1 alert when merging e35ace75a55253971049ee8be16e86e63a103533 into 3e4918f960640b5ae09b47e3b7ff5680c891dbff - view on LGTM.com
new alerts:
- 1 for Unused local variable
As this needs a rebase anyway, the easiest way to move it forward is probably to split it in 2 PRs: one for the changes in cling and PyROOT (where we can try to solve the remaining related issues), one for the new RDF features (where we can add the new features to Filter too, and the corresponding tests -- I can help here).
Yes agreed. I'll have a look at this next week. The tradeoffs in jitting vs run time maybe still need to be discussed further (or maybe this is no longer relevant if/when https://github.com/root-project/cling/issues/443 can be addressed)