uproot5 feat: implement TGraph-writing

feat: implement TGraph-writing

Open jpivarski opened this issue 6 months ago • 2 comments

This is a draft. It needs TGraphErrors, TGraphAsymmErrors, tests of writing to actual files and reading them back with ROOT, and maybe some high-level interface so that we can

output_file["name"] = XYZ

for a Python object XYZ that would be interpreted as a TGraph (or TGraphErrors, or TGraphAsymmErrors). What should that XYZ be? Not raw NumPy arrays, since a tuple of NumPy arrays is already interpreted as a histogram. Does it need to be some kind of uproot.make_TGraph function? It would be nice to have something more Pythonic.

Should any Pandas DataFrame with columns named x and y be interpreted as a TGraph? A Pandas DataFrame would supply a title.

Thoughts, @grzanka?

Feb 23 '24 18:02 jpivarski

Should any Pandas DataFrame with columns named x and y be interpreted as a TGraph? A Pandas DataFrame would supply a title.

Thoughts, @grzanka?

Another candidate may be a python dataclass with x and y fields containing arrays of numbers ? Or numpy recarray ?

Feb 24 '24 11:02 grzanka

As discussed here: https://github.com/scikit-hep/uproot5/discussions/1142#discussioncomment-8590182 we may create confusion by considering an object that can be converted both to TTree and TGraph.

Feb 27 '24 09:02 grzanka

Hello @jpivarski! I am a student of @grzanka, and I would like to join to this issue. I have begun to understand how different objects are transformed so they can be saved to a root file, and I believe I got good grasp of it. I have started to implement prototype of a function to_TGraph(), which, when given two arrays of values (and some other arguments), returns a writable TGraph object. It works, but more things still need to be done (especially TGraphErrors and TGraphAsymErrors support)

def to_TGraph(
    fName,
    fTitle,

    fNpoints,
    fX,
    fY,
    fFunctions,  # not used
    fHistogram,  # not used
    fMinimum, 
    fMaximum,

    fLineColor=602,
    fLineStyle=1,
    fLineWidth=1,
    fFillColor=0,
    fFillStyle=1001,
    fMarkerColor=1,
    fMarkerStyle=1,
    fMarkerSize=1.0,):
    
    tobject = uproot.models.TObject.Model_TObject.empty()

    tnamed = uproot.models.TNamed.Model_TNamed.empty()
    tnamed._deeply_writable = True
    tnamed._bases.append(tobject)
    tnamed._members["fName"] = fName
    tnamed._members["fTitle"] = fTitle

    tattline = uproot.models.TAtt.Model_TAttLine_v2.empty()
    tattline._deeply_writable = True
    tattline._members["fLineColor"] = fLineColor
    tattline._members["fLineStyle"] = fLineStyle
    tattline._members["fLineWidth"] = fLineWidth

    tattfill = uproot.models.TAtt.Model_TAttFill_v2.empty()
    tattfill._deeply_writable = True
    tattfill._members["fFillColor"] = fFillColor
    tattfill._members["fFillStyle"] = fFillStyle

    tattmarker = uproot.models.TAtt.Model_TAttMarker_v2.empty()
    tattmarker._deeply_writable = True
    tattmarker._members["fMarkerColor"] = fMarkerColor
    tattmarker._members["fMarkerStyle"] = fMarkerStyle
    tattmarker._members["fMarkerSize"] = fMarkerSize


    tGraph = uproot.models.TGraph.Model_TGraph_v4.empty()

    tGraph._bases.append(tnamed)
    tGraph._bases.append(tattline)
    tGraph._bases.append(tattfill)
    tGraph._bases.append(tattmarker)
    

    tGraph._members["fNpoints"] = fNpoints
    tGraph._members["fX"] = fX
    tGraph._members["fY"] = fY
    # fFuncitons - do i need it? probably not because it is array of zero items
    # fHistogram - do i need it? probably not because it is a nullptr
    tGraph._members["fMinimum"] = fMinimum
    tGraph._members["fMaximum"] = fMaximum

    return tGraph

Now, the issue is deciding what we want to interpret as a TGraph. After some time of thinking, I came up with a few solutions: (when not explicitly said, all rules that are mentioned for TGraph also apply for TGraphErrors and TGraphAsymErrors)

make function that saves to TGraph:

Pros:
- explicit method
- will work with popular datastructure libraries
- can't be confused with anything
- can add some arguments to customize TGraph (like title)
Cons:
- not pythonic
- not within UpRoot convention by asigning using [] (setitem)
crate class that will be always interpreted as TGraph (something like TGraphConvertable)

Pros:
- within UpRoot convention by asigning using [] (setitem)
- will work with popular datastructure libraries
- can add some arguments to customize TGraph (like title)
- can't be confused with anything
Cons:
- user has to create specific object
- only this class can be interpreted as TGraph
use metadata hiden in some most popular datastructure libraries (eg. numpy.dtype.metadata), when certain flag is set (eg. {"as_TGraph": True}) then object will be interpreted as TGraph

Pros:
- saving to TGraph will work with many popular datastructure libraries
- within UpRoot convention by asigning using [] (setitem)
Cons:
- interfering with internal informations about datastructures
- some of popular datastructure libraries don't have metadata, don't give access to it or access is difficult
- every datastructure librariary have different access to metadata so implementing support for all of them will be tedious
- will require additional function to set this metadata
interpret dict/NamedTuple with keys: {'x','y','errors_x_high','errors_x_low','errors_y_high','errors_y_low', ...} as TGraphAsymmErrors, when 'errors_*_low' are set to "None" interpret it as TGraphErrors, when 'errors_*_*' are set to "None" interpret is as TGraph

Pros:
- within UpRoot convention by asigning using [] (setitem)
- it will work with all popular datastructure libraries
- require little effort from user
- can add aditional keys to customize TGraph (eg. {title,...]})
Cons:
- it overlaps with existing TTree/histogram convention
point 4. solution with flag "implicit_TGraph_convertion", if set to True point 4. works normally, if set to False no implicit TGraph convertion (require other method, maybe combined with point 1. or/and 2.)) (in my opinion default set to True)

Pros:
- within UpRoot convention by asigning using [] (setitem)
- it will work with all popular datastructure libraries
- require little effort from user
- user can chose between many saving methods
- reduce unintentional overlapping
Cons:
- added hiden complexity
- does not remove overlapping entirely

In my opinion solution number 5. is the best one. It gives the most flexibility with little risk of unintentional conversion.

What do you think? Which solution will suit the best? Or maybe you have your own vision that would be better? I would like to know your opinion!

Jul 23 '24 13:07 Pepesob

It can't be a dict because all dicts are already interpreted as TTree (as a set of TBranch names to arrays).

Adding attributes, as in point 5, is somewhat cumbersome—it would have to be extra lines of Python, hard to do interactively—and hidden. You would just need to know the name, and even if you know that there is a special name, you might need to look it up to get the right spelling.

How about a DataFrame? It can be recognized if it has column names like x, y, errors_x, errors_y, errors_x_high, errors_x_low, errors_y_high, errors_y_low, with rules for how errors_x_high and errors_x_low overrule errors_x (or raise an error if both are present), and rules for whether to make a TGraph, TGraphErrors, or TGraphAsymmErrors.

DataFrames (just Pandas or all of the new ones, Polars, CuDF, Modin? Can they be recognized in bulk?) are the basic way of describing scatter-point data in Python, and the TGraph* classes are the basic way of describing scatter-point data in ROOT.

Jul 23 '24 13:07 jpivarski

Isn't there the same problem with DataFrame as it is with Dictionary? Right now pandas DataFrame is transformed into dictionary and then interpreted as TTree. https://github.com/scikit-hep/uproot5/blob/0a997cd40e7661850f1d426f596cab615926a854/src/uproot/writing/identify.py#L53-L59 If I understand correctly you suggest to add logic, so whenever there are columns in DataFrame with given names {x, y, errors_x, etc... }, then recognize DataFrame as TGraph (or TGraphErrors, TGraphAsymmErrors)?

Jul 23 '24 14:07 Pepesob

We could catch the DataFrame type before asking if it's a Mapping. This would take away some of the types that are currently being identified as TTrees, but accidentally. There needs to be simple, easy-to-remember criteria for what will become a TTree and what will become a TGraph*, and DataFrame vs other Mapping could be that criteria. Presence of a special attribute is more subtle.

Jul 23 '24 14:07 jpivarski

I think that always converting DataFrame into TGraph is a good idea, because usually DataFrame do not have entries of variable size in column (at least in pandas), so they could be rightly interpreted as [x,y] (+errors) coordinates. However wouldn't that change break people existing code that transforms pandas DataFrame into TTree? I would still propose the flag implicit_DataFrame_TGraph_convertion to give choice to people.

Jul 23 '24 14:07 Pepesob

I had forgotten that DataFrame was a major, recommended way to make TTrees (not just an accident of DataFrames being mappings).

Okay.

But still, setting a flag like implicit_DataFrame_TGraph_convertion on the DataFrame (can't do it to a Python dict) is really obscure. It's hard to memorize; if I type implicitly_DataFrame_TGraph_convertion, it won't work.

What about this?

f = uproot.recreate("/tmp/stuff.root")
df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
f["ttree"] = df
f["tgraph"] = uproot.as_TGraph(df)

Then the user only needs to remember a function named uproot.as_TGraph, and it might be discoverable by tab-complete. ("What was it? uproot.<tab> (lots of things), uproot.as_<tab>, ah, there it is!")

This function could just add the flag, and a DataFrame/namedtuple with that flag is recognized as a TGraph. So it's possible to do both.

Jul 23 '24 15:07 jpivarski

Okay there is slight misconception with the flag. What i mean was something like this:

implicit_DataFrame_TGraph_convertion = True

def set_implicit_DataFrame_TGraph_convertion(value: bool):
    if not isinstance(value, bool):
        raise ValueError("Flag must be boolean")
    
    global implicit_DataFrame_TGraph_convertion
    implicit_DataFrame_TGraph_convertion = value

and then in the add_to_directory function

if implicit_DataFrame_TGraph_convertion and uproot._util.is_dataframe(obj):
    if uproot._util.from_module(obj, "pandas"):
        import pandas
        [...]

So it is up to the user what behaviour he wants, implicit DataFrame conversion to TGraph or not. But let's move on.

Now I will refer to my 2. proposition point.

f = uproot.recreate("/tmp/stuff.root")
df = pd.DataFrame({"x": [1, 2, 3], "y": [4, 5, 6]})
f["ttree"] = df
f["tgraph"] = uproot.as_TGraph(df)

uproot.as_TGraph(df) would return something like TGraphConvertable and then this object would be written to file.

On a second thought, there is something like this uproot.models.TGraph.Model_TGraph_v4 https://github.com/scikit-hep/uproot5/blob/0a997cd40e7661850f1d426f596cab615926a854/src/uproot/models/TGraph.py#L29-L344

So uproot.as_TGraph(df) would just return uproot.models.TGraph.Model_TGraph_v4 and this object would be written to file. Moreover this solution won't require any changes in identify.py because uproot.models.TGraph.Model_TGraph_v4 is Model object, and for those there is already logic in identify.py . At least i think so. I need to doulbe check it but i think this would work.

Jul 23 '24 16:07 Pepesob

I didn't realize that you were suggesting a global flag, but we shouldn't do that for multiple reasons. (What if the user wants to write TTrees and TGraphs? What if the user is not using Uproot directly, but through a library and that library changes this state? What if the user is using a library of a library, and one level does it and another doesn't know that it has happened?)

But having uproot.as_TGraph create a uproot.models.TGraph.Model_TGraph_v4 is a direct way to do it, if you're not interested in the per-instance flag.

Jul 23 '24 16:07 jpivarski

Hey @jpivarski! I implemented to_TGraph() function. I also wrote test for this feature. What should be my next step? Should I create pull request to this branch (jpivarski/start-implementation-of-TGraph-writing) or directly to main?

Jul 24 '24 18:07 Pepesob

To simplify the git topology, please open a PR for inclusion into main.

Is your branch derived from this branch? If so, then I can close this PR in favor of yours and @grzanka's commits should still be in it, so that the final PR will count as being co-authored by the two of you.

Jul 24 '24 18:07 jpivarski

Yes I made branch derived from this one

Jul 24 '24 18:07 Pepesob

uproot5 uproot5 copied to clipboard

feat: implement TGraph-writing

uproot5
uproot5 copied to clipboard