uproot5 icon indicating copy to clipboard operation
uproot5 copied to clipboard

Database of built-in streamers? (including TMatrixTSym)

Open seophine opened this issue 4 years ago • 17 comments

I am trying to read a TMatrixTSym using uproot version 4 but the all_members attribute gives an empty dictionary. I have checked the root file and it is not empty.

seophine avatar May 13 '21 12:05 seophine

If you include an example file here, I'll try it out.

jpivarski avatar May 13 '21 13:05 jpivarski

correlations.zip

The TMatrixTSym is called: "correlation_matrix"

seophine avatar May 13 '21 14:05 seophine

The problem is that this object is unrecognized, and a dummy (Unknown) object was returned in its place:

>>> rootdir = uproot.open("uproot-issue-359.root")
>>> rootdir["correlation_matrix"]
<Unknown TMatrixTSym<double> at 0x7fa0433e0dc0>
>>> isinstance(rootdir["correlation_matrix"], uproot.model.UnknownClass)
True

The reason for that is that the file doesn't contain any streamers to say how to turn the raw bytes into an object:

>>> [x for x in f.file.streamers if "matrix" in x.lower()]
['TMatrixTBase<double>']

Strangely, there's a streamer for one of TMatrixTSym's superclasses, but not for TMatrixTSym itself.

This ROOT file is not self-describing; neither Uproot nor ROOT should be able to read the TMatrixTSym object, though ROOT contains a version of all of these classes and might use its built-in TMatrixTSym class definition if the version happens to be the same.

In this case, the class version of TMatrixTSym is 2 and has been since the first appearance of this file, so there is not yet any ROOT version that would fail to read TMatrixTSym specifically, but some classes change more frequently than that.

I don't know how the streamer got dropped from this file but in principle, that's the problem.

You get me wondering if I should make a database of streamers for all versions of every class in ROOT... I'm not sure how I would go about doing that, but it would solve problems like this, since Uproot doesn't (shouldn't!) have a manual reimplementation of every class in the ROOT codebase. Such a database also wouldn't be able to include user-defined classes. If those aren't included in the streamers, no program would be able to read them.

jpivarski avatar May 13 '21 14:05 jpivarski

Hi, I am in the same position as OP, it seems. Trying to use a covariance matrix from a ROOT file of a data release, but the returned object is UnknownClass. Is there any way around this issue? Should I ask the creators of said ROOT file to do something differently?

ast0815 avatar Oct 20 '21 13:10 ast0815

In PyROOT, do

>>> import ROOT
>>> f = ROOT.TFile("path/to/your/file.root")
>>> f.ShowStreamerInfo()

If the class of the object you're trying to access (TMatrixTSym?) has no streamer in the list of streamers that the above should print out, then I'd say that the file is erroneous and we should find out why files are being written without this streamer info. Specifically, it lacks schema evolution, so even if this file can be opened by a specific version of ROOT (because that version of ROOT contains a class definition with the right version), another version of ROOT might not be able to open it.

The title of this issue is about the possibility of adding to Uproot a database of streamer info for all known class-version combinations, which would improve Uproot's ability to read classes that lack streamer info, but it would never be 100%. There may be new versions of ROOT that are unknown to the database, class-version combinations that don't match official releases (because somebody manually compiled ROOT between releases and wrote a file with it), user-defined classes, etc. The only way to completely get it right is to ensure that every class instance in a file has that class's streamer info embedded in the file.

jpivarski avatar Oct 20 '21 13:10 jpivarski

Seems to be the same as OP's file:

>>> f.ShowStreamerInfo()
OBJ: TList	TList	Doubly linked list : 0

StreamerInfo for class: TMatrixTBase<double>, version=5, checksum=0x8b1ac221
  TObject        BASE            offset=  0 type=66 Basic ROOT object   
  Int_t          fNrows          offset=  0 type= 3 number of rows      
  Int_t          fNcols          offset=  0 type= 3 number of columns   
  Int_t          fRowLwb         offset=  0 type= 3 lower bound of the row index
  Int_t          fColLwb         offset=  0 type= 3 lower bound of the col index
  Int_t          fNelems         offset=  0 type= 3 number of elements in matrix
  Int_t          fNrowIndex      offset=  0 type= 3 length of row index array (= fNrows+1) wich is only used for sparse matrices
  double         fTol            offset=  0 type= 8 sqrt(epsilon); epsilon is smallest number number so that  1+epsilon > 1

I asked the people who created the file in question and they do not remember doing anything special. The matrix was probably stored just like you would any other object. Though they said they would try to dig up the script that created the file.

And sorry if this is deviating from the topic of this issue too much. I am happy to open a new one if that would be preferable.

ast0815 avatar Oct 21 '21 16:10 ast0815

This is not getting too far from the thread's original intention.

Putting aside the more general solution of a database of streamers, I did some digging to find out what this class's streamer ought to be. I can't find a way to print it out nicely, as it would be if it were in a file, but on the ROOT prompt,

root [0] auto s = dynamic_cast<TStreamerInfo*>(TClass::GetClass("TMatrixTSym<double>")->GetStreamerInfo())
(TStreamerInfo *) @0x7ffe9ba4eb10
root [1] s->GetNelement()
(int) 2
root [2] s->GetElement(0)->GetName()
(const char *) "TMatrixTBase<double>"
root [3] s->GetElement(0)->GetTypeName()
(const char *) "BASE"
root [4] s->GetElement(1)->GetName()
(const char *) "fElements"
root [5] s->GetElement(1)->GetTypeName()
(const char *) "double*"
root [6] s->GetClassVersion()
(int) 2

TMatrixTSym has two members: (1) the TMatrixTBase, which is in the file, so there's no need to simulate that, and (2) an array of doubles—the actual data. It should be possible to make an Uproot Model by hand (in the src/uproot/models directory) that relies on TMatrixTBase being present.

It might look like Model_TGraph_v4, in that it has one self._bases.append to load the TMatrixTBase as its only base class, and then cursor.array for the fElements. I'm guessing that the length of that array will be fNelems, a member datum of the TMatrixTBase. read_members is the only method that would need a non-trivial implementation; read_member_n, strided_interpretation, awkward_form, and _serialize can all immediately raise exceptions. The name of that class would have to be Model_TMatrixTSym_3c_double_3e__v2. I'm not sure whether the array needs a "speedbump" byte or not; that would have to be determined experimentally.

This would be much easier if the streamer info were just encoded in the file. Actually, let me try to see if it can be done in some semi-automatic way, from PyROOT. I need to recompile ROOT because I upgraded Python, though...

jpivarski avatar Oct 21 '21 17:10 jpivarski

I'm going to give you a technique that may become a feature someday. As it turns out, we can get streamer info from an active ROOT process:

>>> import ROOT
>>> import uproot
>>> streamer_bytes = uproot.pyroot.pyroot_to_buffer(
...     ROOT.TClass.GetClass("TMatrixTSym<double>").GetStreamerInfo()
... )
>>> streamer_bytes
array([ 64,   0,   1, 148, 255, 255, 255, 255,  84,  83, 116, 114, 101,
        97, 109, 101, 114,  73, 110, 102, 111,   0,  64,   0,   1, 126,
         0,   9,  64,   0,   0,  33,   0,   1,   0,   1,   0,   0,   0,
         0,   3,   1,   0,   0,  19,  84,  77,  97, 116, 114, 105, 120,
        84,  83, 121, 109,  60, 100, 111, 117,  98, 108, 101,  62,   0,
       200, 115, 115,  45,   0,   0,   0,   2,  64,   0,   1,  75, 255,
       255, 255, 255,  84,  79,  98, 106,  65, 114, 114,  97, 121,   0,
        64,   0,   1,  57,   0,   3,   0,   1,   0,   0,   0,   0,   2,
         0,   0,   0,   0,   0,   0,   0,   2,   0,   0,   0,   0,  64,
         0,   0, 113, 255, 255, 255, 255,  84,  83, 116, 114, 101,  97,
       109, 101, 114,  66,  97, 115, 101,   0,  64,   0,   0,  91,   0,
         3,  64,   0,   0,  81,   0,   4,  64,   0,   0,  34,   0,   1,
         0,   1,   0,   0,   0,   0,   3,   0,   0,   0,  20,  84,  77,
        97, 116, 114, 105, 120,  84,  66,  97, 115, 101,  60, 100, 111,
       117,  98, 108, 101,  62,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
       139,  26, 194,  33,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   4,  66,  65,  83,  69,   0,   0,   0,   5,  64,
         0,   0, 171, 255, 255, 255, 255,  84,  83, 116, 114, 101,  97,
       109, 101, 114,  66,  97, 115, 105,  99,  80, 111, 105, 110, 116,
       101, 114,   0,  64,   0,   0, 141,   0,   2,  64,   0,   0, 102,
         0,   4,  64,   0,   0,  52,   0,   1,   0,   1,   0,   0,   0,
         0,   3,   0,   0,   0,   9, 102,  69, 108, 101, 109, 101, 110,
       116, 115,  29,  91, 102,  78, 101, 108, 101, 109, 115,  93,  32,
       101, 108, 101, 109, 101, 110, 116, 115,  32, 116, 104, 101, 109,
       115, 101, 108, 118, 101, 115,   0,   0,   0,  48,   0,   0,   0,
         8,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
         0,   0,   0,   7, 100, 111, 117,  98, 108, 101,  42,   0,   0,
         0,   5,   7, 102,  78, 101, 108, 101, 109, 115,  20,  84,  77,
        97, 116, 114, 105, 120,  84,  66,  97, 115, 101,  60, 100, 111,
       117,  98, 108, 101,  62], dtype=uint8)
>>> chunk = uproot.source.chunk.Chunk.wrap(None, streamer_bytes)
>>> cursor = uproot.source.cursor.Cursor(0)
>>> fake_file = uproot.writing._cascade._ReadForUpdate("<none>", None)   # need a class_named method
>>> uproot_streamer = uproot.deserialization.read_object_any(chunk, cursor, {}, fake_file, None, None)
>>> uproot_streamer
<TStreamerInfo for TMatrixTSym<double> version 2 at 0x7f0b70254c40>
>>> uproot_streamer.show()
TMatrixTSym<double> (v2): TMatrixTBase<double> (v5)
    fElements: double* (TStreamerBasicPointer)

Even if you don't have access to ROOT and the final reading process in the same Python, this uproot_streamer can be pickled:

>>> import pickle
>>> pickle.loads(pickle.dumps(uproot_streamer)).show()
TMatrixTSym<double> (v2): TMatrixTBase<double> (v5)
    fElements: double* (TStreamerBasicPointer)

To use it, you'd want to put it into the file that you're reading a TMatrixTSym from:

>>> real_file = uproot.open("...")
>>> real_file.file.streamers
{'TNamed': {1: <TStreamerInfo for TNamed version 1 at 0x7f0b6fdf47f0>},
 'TObject': {1: <TStreamerInfo for TObject version 1 at 0x7f0b6fde7af0>},
 ...
}

The following is untested, because I don't have a file with a TMatrixTSym object in it:

>>> real_file.file.streamers[uproot_streamer.name] = {uproot_streamer.class_version: uproot_streamer}

(It's a dict of dicts: class name → class version → streamer object. If the class name already exists with the wrong version, you'd want to add this version to its dict, not replace the whole dict.)

Then you ought to be able to read that TMatrixTSym object, because when the file is serving up the data and checking its streamers to determine how to interpret it, it should find this streamer. Fingers crossed!

Incidentally, if we put the TMatrixTSym Model directly into Uproot, then this is the code we would have had to write:

>>> print(uproot_streamer.class_code())
class Model_TMatrixTSym_3c_double_3e__v2(uproot.model.VersionedModel):
    def read_members(self, chunk, cursor, context, file):
        if self.is_memberwise:
            raise NotImplementedError(
                "memberwise serialization of {0}\nin file {1}".format(type(self).__name__, self.file.file_path)
            )
        self._bases.append(c('TMatrixTBase<double>', 5).read(chunk, cursor, context, file, self._file, self._parent, concrete=self.concrete))
        tmp = self._dtype0
        if context.get('speedbump', True):
            cursor.skip(1)
        self._members['fElements'] = cursor.array(chunk, self.member('fNelems'), tmp, context)
...

Getting the streamer from ROOT saved us the trouble! (Indeed, there is a speedbump byte, and fNelems is the length of the array.)

jpivarski avatar Oct 21 '21 18:10 jpivarski

I tried the method you suggested, but I get an error:

---------------------------------------------------------------------------
DeserializationError                      Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/uproot/reading.py in get(self)
   2484             try:
-> 2485                 out = cls.read(chunk, cursor, context, self._file, selffile, parent)
   2486 

~/.local/lib/python3.7/site-packages/uproot/model.py in read(cls, chunk, cursor, context, file, selffile, parent, concrete)
   1310             versioned_cls.read(
-> 1311                 chunk, cursor, context, file, selffile, parent, concrete=concrete
   1312             ),

~/.local/lib/python3.7/site-packages/uproot/model.py in read(cls, chunk, cursor, context, file, selffile, parent, concrete)
    821 
--> 822             self.read_members(chunk, cursor, context, file)
    823 

<dynamic> in read_members(self, chunk, cursor, context, file)

~/.local/lib/python3.7/site-packages/uproot/source/cursor.py in array(self, chunk, length, dtype, context, move)
    326             self._index = stop
--> 327         return numpy.frombuffer(chunk.get(start, stop, self, context), dtype=dtype)
    328 

~/.local/lib/python3.7/site-packages/uproot/source/chunk.py in get(self, start, stop, cursor, context)
    401                 context,
--> 402                 self._source.file_path,
    403             )

DeserializationError: while reading

    TMatrixTSym<double> version 5 as uproot.dynamic.Model_TMatrixTSym_3c_double_3e__v2 (48 bytes)
        (base): <TMatrixTBase<double> (version 5) at 0x7f914b6266d8>
Base classes for TMatrixTSym<double>: (TMatrixTBase<double>)
Members for TMatrixTSym<double>: fElements

attempting to get bytes 51:440926259
outside expected range 0:3528 for this Chunk
in file DataRelease/covmatrix_noreg.root
in object /covmatrixCbin;1

During handling of the above exception, another exception occurred:

DeserializationError                      Traceback (most recent call last)
<ipython-input-5-1e5bf2aa6a37> in <module>
      2 F = up.open("DataRelease/covmatrix_noreg.root")
      3 fix_streamer(F)
----> 4 cov_unfolded = F["covmatrixCbin"].to_numpy()[0]
      5 cor_unfolded = cov_unfolded / (
      6     np.sqrt(np.diag(cov_unfolded))[:, None] * np.sqrt(np.diag(cov_unfolded))[None, :]

~/.local/lib/python3.7/site-packages/uproot/reading.py in __getitem__(self, where)
   2080 
   2081         else:
-> 2082             return self.key(where).get()
   2083 
   2084     @property

~/.local/lib/python3.7/site-packages/uproot/reading.py in get(self)
   2509                 context = {"breadcrumbs": (), "TKey": self}
   2510 
-> 2511                 out = cls.read(chunk, cursor, context, self._file, selffile, parent)
   2512 
   2513         if self._fClassName not in must_be_attached:

~/.local/lib/python3.7/site-packages/uproot/model.py in read(cls, chunk, cursor, context, file, selffile, parent, concrete)
   1309         return cls.postprocess(
   1310             versioned_cls.read(
-> 1311                 chunk, cursor, context, file, selffile, parent, concrete=concrete
   1312             ),
   1313             chunk,

~/.local/lib/python3.7/site-packages/uproot/model.py in read(cls, chunk, cursor, context, file, selffile, parent, concrete)
    820             )
    821 
--> 822             self.read_members(chunk, cursor, context, file)
    823 
    824             self.hook_after_read_members(

<dynamic> in read_members(self, chunk, cursor, context, file)

~/.local/lib/python3.7/site-packages/uproot/source/cursor.py in array(self, chunk, length, dtype, context, move)
    325         if move:
    326             self._index = stop
--> 327         return numpy.frombuffer(chunk.get(start, stop, self, context), dtype=dtype)
    328 
    329     _u1 = numpy.dtype("u1")

~/.local/lib/python3.7/site-packages/uproot/source/chunk.py in get(self, start, stop, cursor, context)
    400                 cursor.copy(),
    401                 context,
--> 402                 self._source.file_path,
    403             )
    404 

DeserializationError: while reading

    TMatrixTSym<double> version 5 as uproot.dynamic.Model_TMatrixTSym_3c_double_3e__v2 (48 bytes)
        (base): <TMatrixTBase<double> (version 5) at 0x7f914b62ef60>
Base classes for TMatrixTSym<double>: (TMatrixTBase<double>)
Members for TMatrixTSym<double>: fElements

attempting to get bytes 51:440926259
outside expected range 0:3528 for this Chunk
in file DataRelease/covmatrix_noreg.root
in object /covmatrixCbin;1

I attached the ROOT file I am trying to read the matrix from, in case you want to test this yourself: covmatrix_noreg.zip

ast0815 avatar Oct 22 '21 13:10 ast0815

It successfully read the TMatrixTBase and then failed trying to get the fElements, thinking that it needed to read bytes 51 through 440926259 to get it—that upper bound is clearly bogus (unless you have 55115776 double-precision numbers in fElements!). Since the upper bound comes from the TMatrixTBase (fNelems), maybe that was filled and TMatrixTBase has the right number of bytes, but the value is wrong.

So the automatically generated streamer is not right either, even when we manage to get it directly from ROOT. I'd have to look at the file directly. I'm looking at the file you attached.

jpivarski avatar Oct 22 '21 14:10 jpivarski

I've started #484, which can read the matrices in the file you sent, but it's very weird: it doesn't have a separate header for the TMatrixTSym, as opposed to the TMatrixBase, switch is what got it off-track and garbled the fNelems, causing it to read past the end of the object. However, even when that number is not garbled, it's not the number of serialized values: it's equal to N**2, rather than (N*(N + 1)/2. And then those elements are not counted in the number of bytes ) the number of bytes corresponds to the TMatrixBase, rather than the TMatrixTSym.

Your file was written with ROOT 5, which makes me a little wary, but the serialization of these classes hasn't been changed in 16 years. Maybe that's early enough that they don't follow standard conventions for streamers—it seems to be a "custom streamer," and having the streamer in the file doesn't help. (Maybe that's why it was not included in the first place...)

I'd feel more confident if we could test more cases, especially non-symmetric subclasses of TMatrixBase. Do you have any of those?

jpivarski avatar Oct 22 '21 16:10 jpivarski

The relevant subclasses are TMatrixT, TMatrixTSparse, and TMatrixTSym, and I can guess that they can be specialized with any numerical type, but I think in practice, only double is ever used. PR #484 should work for you, but I'd be more comfortable if we could include tests of all three subclasses. I think this whole system is based on custom streamers.

jpivarski avatar Oct 22 '21 16:10 jpivarski

I'm putting that PR into draft mode so that I don't accidentally merge it. We'd need to be more certain that this is the right thing to do before including it.

jpivarski avatar Oct 27 '21 21:10 jpivarski

Would adding some tests for TMatrixT, TMatrixTSparse, and TMatrixTSym be enough for that? Maybe both with ROOT 5 and ROOT 6 files? Or is the concern more about the principle of the matter?

ast0815 avatar Oct 28 '21 09:10 ast0815

It's not an in-principle thing: just an example of each of the subclasses would do. The changes I had to make for the symmetric matrix were arbitrary—assuming that no changes are needed for the other cases would likely be wrong, and fixing one but not the others would really muddle things.

Also I don't think it has to be both ROOT 5 and 6—the classes look old. I don't think they were changed recently, even on the timescale of the ROOT 5 → 6 transition.

jpivarski avatar Oct 28 '21 11:10 jpivarski

I just tried to load the matrix with the PR branch of uproot 4, and it seems to work as expected.

Though now I have the problem that the resulting object does not have a to_numpy method. How do I actually access the matrix elements?

ast0815 avatar Nov 30 '21 18:11 ast0815

The raw values are in obj.member("fElements"), but these haven't been corrected for the symmetric-matrix packing (only the diagonal and above—or maybe below—are stored in that 1-dimensional array). A to_numpy method that presents it as a 2-dimensional matrix would be useful; it would have to be written as a member of the

https://github.com/scikit-hep/uproot4/blob/73f494980f310756d40a736030f0c96f7610f016/src/uproot/models/TMatrixT.py#L18

class, and similar for the TMatrixT and TMatrixTSparse cases.

jpivarski avatar Nov 30 '21 18:11 jpivarski