uproot5
uproot5 copied to clipboard
Allowing ```TBranch:: filter_name``` to process expressions or ```TBranch:: expressions``` to process regex?
I am trying to develop code which will parse a series of branches and use uproot to read those branches. The branches could be passed as a regex pattern or as an exact name or as an expression.
I am looking to combine all the requested branches into an array and passing them in one go to uproot. The TBranch argument filter_name
fails to match the expressions and the expressions list fails to process regex patterns. Even if I group expressions together and filters together, exoressions
always overrides the filter_name
argument. Is it possible to allow the filter_name
argument to process an expression or for the expression
argument to process regex patterns?
I think that combining the functionality of filter_name
and expressions
would severely complicate them. They're kept separate because *
has a very different meaning in regular expression search than it does in mathematical formulas. Even if there's a way to make this two-pass thing work, isn't that going to lead to more confusion?
I have to ask, why is it important for so much of the analysis to be done in a one-line function call? I don't know of a real reason to use expressions
as mathematical formulas at all, except for the fact that the functionality had to be implemented somewhere to support TTree's built-in aliases, so it might as well be exposed more generally. Why not use filter_name
to get a set of uncomputed arrays and do array manipulation in Python statements, rather than in a string? It is the same evaluation; there is no performance advantage or anything.
Hi @jpivarski,
Thanks a lot for the quick reply.
You make a fair point about the confusion. I’m not entirely sure how regex processing is taking place in Uproot backend, but I would have thought the use of the syntax /pattern/i
would distinguish between an expression and a regex pattern?
The reason I was looking into this is because we are writing a framework which will parse from a configuration file, so the exact expressions are unknown until runtime. I can’t think of a straight-forward way (that doesn’t involve writing up an interpreter) to breakdown the expression from the config, call the branches individually then perform the calculation on them. There is also the problem of interpreting slicing expressions (e.g. Branch[:,0])
Yes, the slashes at the beginning and end of the string (/pattern/
with possible flags like i
) is the signal that this is a regex filter:
https://github.com/scikit-hep/uproot4/blob/a17aca8cc07ba203e48fe63dc0d5f0b6749de4a1/src/uproot/_util.py#L156-L161
with
https://github.com/scikit-hep/uproot4/blob/a17aca8cc07ba203e48fe63dc0d5f0b6749de4a1/src/uproot/_util.py#L119
but if you need the calculation to go in stages, first a regex, then a computation, that would get messy. Should there be two strings? Should characters like *
be quoted as \*
in the regex phase, then interpreted as operators in the computation phase? I'm just having trouble seeing that as making people's lives easier, rather than harder.
And even then, the expressions
are not very constrained: they can be any Python expression, which is about half of the Python language (the other half being statements). If I'm not misunderstanding which direction you're going in, you could impose constraints in the configuration language and use that to generate Python. If you're trying to go the other way, taking a Python script and turning it into something declarative, that direction is not possible.
So back to the direction I think you're going in: if you're defining a configuration file language, you can add knobs and dials that generate Python code as easily as it could generate Uproot expression
strings. After all, they're both the same language, Python. You mention "writing up an interpreter"—it wouldn't need to be all that, since generating the text of Python code from templates is not as much work as building an interpreter, it's more of a source code to source code translation. Also, by "configuration file language," I don't mean writing a parser, I mean interpreting nested elements in YAML: that's pretty common for configuration files now. The main thing is that I don't see how what you need to generate is worse or harder if you're generating Python source code that gets run through exec
than it would be if you were generating one line of Python source code that I run through exec
!
It happens here:
https://github.com/scikit-hep/uproot4/blob/a17aca8cc07ba203e48fe63dc0d5f0b6749de4a1/src/uproot/language/python.py#L159-L184
There's some manipulation of the Python AST to recognize "nested.branch.name"
as a single variable, rather than unpacking a class instance, but other than that, it all goes through Python's eval
in the end.
I should mention that there's an ongoing project to add TTree::Draw as a language for expressions
, possibly as a future default. Interpreting the strings as Python happened to be the easiest to implement, not our highest preference.
Hi @jpivarski
Thanks a lot for your reply. I will like into your suggestions. I wasn't looking for anything particularly complicated -- the goal was simply allow a user to specify something like
new_branch = Branch(expression = (branch1*branch2)/2)
regex_banch = Branch(filter=pattern)
and in the back-end I would change pattern -> /pattern/i
to ensure it is explicitly defiend as a regex pattern and I had hoped i would be able to pass this expression for the new branch at the same time to uproot.iterate()
for example so that i can read both branches requested by user. Anyways, thanks a lot for the very detailed and helpful answers :)
I think I kept this open to be informative, but we can do that with Discussions now, so I'll convert this Issue into a Discussion.