uproot5
uproot5 copied to clipboard
feat: implement TGraph-writing
This is a draft. It needs TGraphErrors, TGraphAsymmErrors, tests of writing to actual files and reading them back with ROOT, and maybe some high-level interface so that we can
output_file["name"] = XYZ
for a Python object XYZ that would be interpreted as a TGraph (or TGraphErrors, or TGraphAsymmErrors). What should that XYZ be? Not raw NumPy arrays, since a tuple of NumPy arrays is already interpreted as a histogram. Does it need to be some kind of uproot.make_TGraph
function? It would be nice to have something more Pythonic.
Should any Pandas DataFrame with columns named x
and y
be interpreted as a TGraph? A Pandas DataFrame would supply a title.
Thoughts, @grzanka?
Should any Pandas DataFrame with columns named
x
andy
be interpreted as a TGraph? A Pandas DataFrame would supply a title.Thoughts, @grzanka?
Another candidate may be a python dataclass with x
and y
fields containing arrays of numbers ? Or numpy recarray ?
As discussed here: https://github.com/scikit-hep/uproot5/discussions/1142#discussioncomment-8590182 we may create confusion by considering an object that can be converted both to TTree and TGraph.
Hello @jpivarski! I am a student of @grzanka, and I would like to join to this issue. I have begun to understand how different objects are transformed so they can be saved to a root file, and I believe I got good grasp of it. I have started to implement prototype of a function to_TGraph(), which, when given two arrays of values (and some other arguments), returns a writable TGraph object. It works, but more things still need to be done (especially TGraphErrors and TGraphAsymErrors support)
def to_TGraph(
fName,
fTitle,
fNpoints,
fX,
fY,
fFunctions, # not used
fHistogram, # not used
fMinimum,
fMaximum,
fLineColor=602,
fLineStyle=1,
fLineWidth=1,
fFillColor=0,
fFillStyle=1001,
fMarkerColor=1,
fMarkerStyle=1,
fMarkerSize=1.0,):
tobject = uproot.models.TObject.Model_TObject.empty()
tnamed = uproot.models.TNamed.Model_TNamed.empty()
tnamed._deeply_writable = True
tnamed._bases.append(tobject)
tnamed._members["fName"] = fName
tnamed._members["fTitle"] = fTitle
tattline = uproot.models.TAtt.Model_TAttLine_v2.empty()
tattline._deeply_writable = True
tattline._members["fLineColor"] = fLineColor
tattline._members["fLineStyle"] = fLineStyle
tattline._members["fLineWidth"] = fLineWidth
tattfill = uproot.models.TAtt.Model_TAttFill_v2.empty()
tattfill._deeply_writable = True
tattfill._members["fFillColor"] = fFillColor
tattfill._members["fFillStyle"] = fFillStyle
tattmarker = uproot.models.TAtt.Model_TAttMarker_v2.empty()
tattmarker._deeply_writable = True
tattmarker._members["fMarkerColor"] = fMarkerColor
tattmarker._members["fMarkerStyle"] = fMarkerStyle
tattmarker._members["fMarkerSize"] = fMarkerSize
tGraph = uproot.models.TGraph.Model_TGraph_v4.empty()
tGraph._bases.append(tnamed)
tGraph._bases.append(tattline)
tGraph._bases.append(tattfill)
tGraph._bases.append(tattmarker)
tGraph._members["fNpoints"] = fNpoints
tGraph._members["fX"] = fX
tGraph._members["fY"] = fY
# fFuncitons - do i need it? probably not because it is array of zero items
# fHistogram - do i need it? probably not because it is a nullptr
tGraph._members["fMinimum"] = fMinimum
tGraph._members["fMaximum"] = fMaximum
return tGraph
Now, the issue is deciding what we want to interpret as a TGraph. After some time of thinking, I came up with a few solutions: (when not explicitly said, all rules that are mentioned for TGraph also apply for TGraphErrors and TGraphAsymErrors)
-
make function that saves to TGraph:
Pros:
- explicit method
- will work with popular datastructure libraries
- can't be confused with anything
- can add some arguments to customize TGraph (like title)
Cons:
- not pythonic
- not within UpRoot convention by asigning using [] (setitem)
-
crate class that will be always interpreted as TGraph (something like TGraphConvertable)
Pros:
- within UpRoot convention by asigning using [] (setitem)
- will work with popular datastructure libraries
- can add some arguments to customize TGraph (like title)
- can't be confused with anything
Cons:
- user has to create specific object
- only this class can be interpreted as TGraph
-
use metadata hiden in some most popular datastructure libraries (eg. numpy.dtype.metadata), when certain flag is set (eg. {"as_TGraph": True}) then object will be interpreted as TGraph
Pros:
- saving to TGraph will work with many popular datastructure libraries
- within UpRoot convention by asigning using [] (setitem)
Cons:
- interfering with internal informations about datastructures
- some of popular datastructure libraries don't have metadata, don't give access to it or access is difficult
- every datastructure librariary have different access to metadata so implementing support for all of them will be tedious
- will require additional function to set this metadata
-
interpret dict/NamedTuple with keys: {'x','y','errors_x_high','errors_x_low','errors_y_high','errors_y_low', ...} as TGraphAsymmErrors, when 'errors_*_low' are set to "None" interpret it as TGraphErrors, when 'errors_*_*' are set to "None" interpret is as TGraph
Pros:
- within UpRoot convention by asigning using [] (setitem)
- it will work with all popular datastructure libraries
- require little effort from user
- can add aditional keys to customize TGraph (eg. {title,...]})
Cons:
- it overlaps with existing TTree/histogram convention
-
point 4. solution with flag "implicit_TGraph_convertion", if set to True point 4. works normally, if set to False no implicit TGraph convertion (require other method, maybe combined with point 1. or/and 2.)) (in my opinion default set to True)
Pros:
- within UpRoot convention by asigning using [] (setitem)
- it will work with all popular datastructure libraries
- require little effort from user
- user can chose between many saving methods
- reduce unintentional overlapping
Cons:
- added hiden complexity
- does not remove overlapping entirely
In my opinion solution number 5. is the best one. It gives the most flexibility with little risk of unintentional conversion.
What do you think? Which solution will suit the best? Or maybe you have your own vision that would be better? I would like to know your opinion!
It can't be a dict because all dicts are already interpreted as TTree (as a set of TBranch names to arrays).
Adding attributes, as in point 5, is somewhat cumbersome—it would have to be extra lines of Python, hard to do interactively—and hidden. You would just need to know the name, and even if you know that there is a special name, you might need to look it up to get the right spelling.
How about a DataFrame? It can be recognized if it has column names like x
, y
, errors_x
, errors_y
, errors_x_high
, errors_x_low
, errors_y_high
, errors_y_low
, with rules for how errors_x_high
and errors_x_low
overrule errors_x
(or raise an error if both are present), and rules for whether to make a TGraph, TGraphErrors, or TGraphAsymmErrors.
DataFrames (just Pandas or all of the new ones, Polars, CuDF, Modin? Can they be recognized in bulk?) are the basic way of describing scatter-point data in Python, and the TGraph* classes are the basic way of describing scatter-point data in ROOT.
Isn't there the same problem with DataFrame as it is with Dictionary? Right now pandas DataFrame is transformed into dictionary and then interpreted as TTree.
https://github.com/scikit-hep/uproot5/blob/0a997cd40e7661850f1d426f596cab615926a854/src/uproot/writing/identify.py#L53-L59
If I understand correctly you suggest to add logic, so whenever there are columns in DataFrame with given names {x, y, errors_x, etc... }
, then recognize DataFrame as TGraph (or TGraphErrors, TGraphAsymmErrors)?
We could catch the DataFrame type before asking if it's a Mapping. This would take away some of the types that are currently being identified as TTrees, but accidentally. There needs to be simple, easy-to-remember criteria for what will become a TTree and what will become a TGraph*, and DataFrame vs other Mapping could be that criteria. Presence of a special attribute is more subtle.
I think that always converting DataFrame into TGraph is a good idea, because usually DataFrame do not have entries of variable size in column (at least in pandas), so they could be rightly interpreted as [x,y] (+errors) coordinates. However wouldn't that change break people existing code that transforms pandas DataFrame into TTree? I would still propose the flag implicit_DataFrame_TGraph_convertion
to give choice to people.
I had forgotten that DataFrame was a major, recommended way to make TTrees (not just an accident of DataFrames being mappings).
Okay.
But still, setting a flag like implicit_DataFrame_TGraph_convertion
on the DataFrame (can't do it to a Python dict) is really obscure. It's hard to memorize; if I type implicitly_DataFrame_TGraph_convertion
, it won't work.
What about this?
f = uproot.recreate("/tmp/stuff.root")
df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
f["ttree"] = df
f["tgraph"] = uproot.as_TGraph(df)
Then the user only needs to remember a function named uproot.as_TGraph
, and it might be discoverable by tab-complete. ("What was it? uproot.<tab>
(lots of things), uproot.as_<tab>
, ah, there it is!")
This function could just add the flag, and a DataFrame/namedtuple with that flag is recognized as a TGraph. So it's possible to do both.
Okay there is slight misconception with the flag. What i mean was something like this:
implicit_DataFrame_TGraph_convertion = True
def set_implicit_DataFrame_TGraph_convertion(value: bool):
if not isinstance(value, bool):
raise ValueError("Flag must be boolean")
global implicit_DataFrame_TGraph_convertion
implicit_DataFrame_TGraph_convertion = value
and then in the add_to_directory function
if implicit_DataFrame_TGraph_convertion and uproot._util.is_dataframe(obj):
if uproot._util.from_module(obj, "pandas"):
import pandas
[...]
So it is up to the user what behaviour he wants, implicit DataFrame conversion to TGraph or not. But let's move on.
Now I will refer to my 2. proposition point.
f = uproot.recreate("/tmp/stuff.root")
df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
f["ttree"] = df
f["tgraph"] = uproot.as_TGraph(df)
uproot.as_TGraph(df)
would return something like TGraphConvertable
and then this object would be written to file.
On a second thought, there is something like this uproot.models.TGraph.Model_TGraph_v4
https://github.com/scikit-hep/uproot5/blob/0a997cd40e7661850f1d426f596cab615926a854/src/uproot/models/TGraph.py#L29-L344
So uproot.as_TGraph(df)
would just return uproot.models.TGraph.Model_TGraph_v4
and this object would be written to file. Moreover this solution won't require any changes in identify.py because uproot.models.TGraph.Model_TGraph_v4
is Model object, and for those there is already logic in identify.py . At least i think so. I need to doulbe check it but i think this would work.
I didn't realize that you were suggesting a global flag, but we shouldn't do that for multiple reasons. (What if the user wants to write TTrees and TGraphs? What if the user is not using Uproot directly, but through a library and that library changes this state? What if the user is using a library of a library, and one level does it and another doesn't know that it has happened?)
But having uproot.as_TGraph
create a uproot.models.TGraph.Model_TGraph_v4
is a direct way to do it, if you're not interested in the per-instance flag.
Hey @jpivarski! I implemented to_TGraph()
function. I also wrote test for this feature. What should be my next step? Should I create pull request to this branch (jpivarski/start-implementation-of-TGraph-writing) or directly to main?
To simplify the git topology, please open a PR for inclusion into main
.
Is your branch derived from this branch? If so, then I can close this PR in favor of yours and @grzanka's commits should still be in it, so that the final PR will count as being co-authored by the two of you.
Yes I made branch derived from this one