Database of built-in streamers? (including TMatrixTSym)
I am trying to read a TMatrixTSym using uproot version 4 but the all_members attribute gives an empty dictionary. I have checked the root file and it is not empty.
If you include an example file here, I'll try it out.
The problem is that this object is unrecognized, and a dummy (Unknown) object was returned in its place:
>>> rootdir = uproot.open("uproot-issue-359.root")
>>> rootdir["correlation_matrix"]
<Unknown TMatrixTSym<double> at 0x7fa0433e0dc0>
>>> isinstance(rootdir["correlation_matrix"], uproot.model.UnknownClass)
True
The reason for that is that the file doesn't contain any streamers to say how to turn the raw bytes into an object:
>>> [x for x in f.file.streamers if "matrix" in x.lower()]
['TMatrixTBase<double>']
Strangely, there's a streamer for one of TMatrixTSym's superclasses, but not for TMatrixTSym itself.
This ROOT file is not self-describing; neither Uproot nor ROOT should be able to read the TMatrixTSym object, though ROOT contains a version of all of these classes and might use its built-in TMatrixTSym class definition if the version happens to be the same.
In this case, the class version of TMatrixTSym is 2 and has been since the first appearance of this file, so there is not yet any ROOT version that would fail to read TMatrixTSym specifically, but some classes change more frequently than that.
I don't know how the streamer got dropped from this file but in principle, that's the problem.
You get me wondering if I should make a database of streamers for all versions of every class in ROOT... I'm not sure how I would go about doing that, but it would solve problems like this, since Uproot doesn't (shouldn't!) have a manual reimplementation of every class in the ROOT codebase. Such a database also wouldn't be able to include user-defined classes. If those aren't included in the streamers, no program would be able to read them.
Hi, I am in the same position as OP, it seems. Trying to use a covariance matrix from a ROOT file of a data release, but the returned object is UnknownClass. Is there any way around this issue? Should I ask the creators of said ROOT file to do something differently?
In PyROOT, do
>>> import ROOT
>>> f = ROOT.TFile("path/to/your/file.root")
>>> f.ShowStreamerInfo()
If the class of the object you're trying to access (TMatrixTSym?) has no streamer in the list of streamers that the above should print out, then I'd say that the file is erroneous and we should find out why files are being written without this streamer info. Specifically, it lacks schema evolution, so even if this file can be opened by a specific version of ROOT (because that version of ROOT contains a class definition with the right version), another version of ROOT might not be able to open it.
The title of this issue is about the possibility of adding to Uproot a database of streamer info for all known class-version combinations, which would improve Uproot's ability to read classes that lack streamer info, but it would never be 100%. There may be new versions of ROOT that are unknown to the database, class-version combinations that don't match official releases (because somebody manually compiled ROOT between releases and wrote a file with it), user-defined classes, etc. The only way to completely get it right is to ensure that every class instance in a file has that class's streamer info embedded in the file.
Seems to be the same as OP's file:
>>> f.ShowStreamerInfo()
OBJ: TList TList Doubly linked list : 0
StreamerInfo for class: TMatrixTBase<double>, version=5, checksum=0x8b1ac221
TObject BASE offset= 0 type=66 Basic ROOT object
Int_t fNrows offset= 0 type= 3 number of rows
Int_t fNcols offset= 0 type= 3 number of columns
Int_t fRowLwb offset= 0 type= 3 lower bound of the row index
Int_t fColLwb offset= 0 type= 3 lower bound of the col index
Int_t fNelems offset= 0 type= 3 number of elements in matrix
Int_t fNrowIndex offset= 0 type= 3 length of row index array (= fNrows+1) wich is only used for sparse matrices
double fTol offset= 0 type= 8 sqrt(epsilon); epsilon is smallest number number so that 1+epsilon > 1
I asked the people who created the file in question and they do not remember doing anything special. The matrix was probably stored just like you would any other object. Though they said they would try to dig up the script that created the file.
And sorry if this is deviating from the topic of this issue too much. I am happy to open a new one if that would be preferable.
This is not getting too far from the thread's original intention.
Putting aside the more general solution of a database of streamers, I did some digging to find out what this class's streamer ought to be. I can't find a way to print it out nicely, as it would be if it were in a file, but on the ROOT prompt,
root [0] auto s = dynamic_cast<TStreamerInfo*>(TClass::GetClass("TMatrixTSym<double>")->GetStreamerInfo())
(TStreamerInfo *) @0x7ffe9ba4eb10
root [1] s->GetNelement()
(int) 2
root [2] s->GetElement(0)->GetName()
(const char *) "TMatrixTBase<double>"
root [3] s->GetElement(0)->GetTypeName()
(const char *) "BASE"
root [4] s->GetElement(1)->GetName()
(const char *) "fElements"
root [5] s->GetElement(1)->GetTypeName()
(const char *) "double*"
root [6] s->GetClassVersion()
(int) 2
TMatrixTSym
It might look like Model_TGraph_v4, in that it has one self._bases.append to load the TMatrixTBase as its only base class, and then cursor.array for the fElements. I'm guessing that the length of that array will be fNelems, a member datum of the TMatrixTBase. read_members is the only method that would need a non-trivial implementation; read_member_n, strided_interpretation, awkward_form, and _serialize can all immediately raise exceptions. The name of that class would have to be Model_TMatrixTSym_3c_double_3e__v2. I'm not sure whether the array needs a "speedbump" byte or not; that would have to be determined experimentally.
This would be much easier if the streamer info were just encoded in the file. Actually, let me try to see if it can be done in some semi-automatic way, from PyROOT. I need to recompile ROOT because I upgraded Python, though...
I'm going to give you a technique that may become a feature someday. As it turns out, we can get streamer info from an active ROOT process:
>>> import ROOT
>>> import uproot
>>> streamer_bytes = uproot.pyroot.pyroot_to_buffer(
... ROOT.TClass.GetClass("TMatrixTSym<double>").GetStreamerInfo()
... )
>>> streamer_bytes
array([ 64, 0, 1, 148, 255, 255, 255, 255, 84, 83, 116, 114, 101,
97, 109, 101, 114, 73, 110, 102, 111, 0, 64, 0, 1, 126,
0, 9, 64, 0, 0, 33, 0, 1, 0, 1, 0, 0, 0,
0, 3, 1, 0, 0, 19, 84, 77, 97, 116, 114, 105, 120,
84, 83, 121, 109, 60, 100, 111, 117, 98, 108, 101, 62, 0,
200, 115, 115, 45, 0, 0, 0, 2, 64, 0, 1, 75, 255,
255, 255, 255, 84, 79, 98, 106, 65, 114, 114, 97, 121, 0,
64, 0, 1, 57, 0, 3, 0, 1, 0, 0, 0, 0, 2,
0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 64,
0, 0, 113, 255, 255, 255, 255, 84, 83, 116, 114, 101, 97,
109, 101, 114, 66, 97, 115, 101, 0, 64, 0, 0, 91, 0,
3, 64, 0, 0, 81, 0, 4, 64, 0, 0, 34, 0, 1,
0, 1, 0, 0, 0, 0, 3, 0, 0, 0, 20, 84, 77,
97, 116, 114, 105, 120, 84, 66, 97, 115, 101, 60, 100, 111,
117, 98, 108, 101, 62, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
139, 26, 194, 33, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 4, 66, 65, 83, 69, 0, 0, 0, 5, 64,
0, 0, 171, 255, 255, 255, 255, 84, 83, 116, 114, 101, 97,
109, 101, 114, 66, 97, 115, 105, 99, 80, 111, 105, 110, 116,
101, 114, 0, 64, 0, 0, 141, 0, 2, 64, 0, 0, 102,
0, 4, 64, 0, 0, 52, 0, 1, 0, 1, 0, 0, 0,
0, 3, 0, 0, 0, 9, 102, 69, 108, 101, 109, 101, 110,
116, 115, 29, 91, 102, 78, 101, 108, 101, 109, 115, 93, 32,
101, 108, 101, 109, 101, 110, 116, 115, 32, 116, 104, 101, 109,
115, 101, 108, 118, 101, 115, 0, 0, 0, 48, 0, 0, 0,
8, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,
0, 0, 0, 7, 100, 111, 117, 98, 108, 101, 42, 0, 0,
0, 5, 7, 102, 78, 101, 108, 101, 109, 115, 20, 84, 77,
97, 116, 114, 105, 120, 84, 66, 97, 115, 101, 60, 100, 111,
117, 98, 108, 101, 62], dtype=uint8)
>>> chunk = uproot.source.chunk.Chunk.wrap(None, streamer_bytes)
>>> cursor = uproot.source.cursor.Cursor(0)
>>> fake_file = uproot.writing._cascade._ReadForUpdate("<none>", None) # need a class_named method
>>> uproot_streamer = uproot.deserialization.read_object_any(chunk, cursor, {}, fake_file, None, None)
>>> uproot_streamer
<TStreamerInfo for TMatrixTSym<double> version 2 at 0x7f0b70254c40>
>>> uproot_streamer.show()
TMatrixTSym<double> (v2): TMatrixTBase<double> (v5)
fElements: double* (TStreamerBasicPointer)
Even if you don't have access to ROOT and the final reading process in the same Python, this uproot_streamer can be pickled:
>>> import pickle
>>> pickle.loads(pickle.dumps(uproot_streamer)).show()
TMatrixTSym<double> (v2): TMatrixTBase<double> (v5)
fElements: double* (TStreamerBasicPointer)
To use it, you'd want to put it into the file that you're reading a TMatrixTSym from:
>>> real_file = uproot.open("...")
>>> real_file.file.streamers
{'TNamed': {1: <TStreamerInfo for TNamed version 1 at 0x7f0b6fdf47f0>},
'TObject': {1: <TStreamerInfo for TObject version 1 at 0x7f0b6fde7af0>},
...
}
The following is untested, because I don't have a file with a TMatrixTSym object in it:
>>> real_file.file.streamers[uproot_streamer.name] = {uproot_streamer.class_version: uproot_streamer}
(It's a dict of dicts: class name → class version → streamer object. If the class name already exists with the wrong version, you'd want to add this version to its dict, not replace the whole dict.)
Then you ought to be able to read that TMatrixTSym object, because when the file is serving up the data and checking its streamers to determine how to interpret it, it should find this streamer. Fingers crossed!
Incidentally, if we put the TMatrixTSym Model directly into Uproot, then this is the code we would have had to write:
>>> print(uproot_streamer.class_code())
class Model_TMatrixTSym_3c_double_3e__v2(uproot.model.VersionedModel):
def read_members(self, chunk, cursor, context, file):
if self.is_memberwise:
raise NotImplementedError(
"memberwise serialization of {0}\nin file {1}".format(type(self).__name__, self.file.file_path)
)
self._bases.append(c('TMatrixTBase<double>', 5).read(chunk, cursor, context, file, self._file, self._parent, concrete=self.concrete))
tmp = self._dtype0
if context.get('speedbump', True):
cursor.skip(1)
self._members['fElements'] = cursor.array(chunk, self.member('fNelems'), tmp, context)
...
Getting the streamer from ROOT saved us the trouble! (Indeed, there is a speedbump byte, and fNelems is the length of the array.)
I tried the method you suggested, but I get an error:
---------------------------------------------------------------------------
DeserializationError Traceback (most recent call last)
~/.local/lib/python3.7/site-packages/uproot/reading.py in get(self)
2484 try:
-> 2485 out = cls.read(chunk, cursor, context, self._file, selffile, parent)
2486
~/.local/lib/python3.7/site-packages/uproot/model.py in read(cls, chunk, cursor, context, file, selffile, parent, concrete)
1310 versioned_cls.read(
-> 1311 chunk, cursor, context, file, selffile, parent, concrete=concrete
1312 ),
~/.local/lib/python3.7/site-packages/uproot/model.py in read(cls, chunk, cursor, context, file, selffile, parent, concrete)
821
--> 822 self.read_members(chunk, cursor, context, file)
823
<dynamic> in read_members(self, chunk, cursor, context, file)
~/.local/lib/python3.7/site-packages/uproot/source/cursor.py in array(self, chunk, length, dtype, context, move)
326 self._index = stop
--> 327 return numpy.frombuffer(chunk.get(start, stop, self, context), dtype=dtype)
328
~/.local/lib/python3.7/site-packages/uproot/source/chunk.py in get(self, start, stop, cursor, context)
401 context,
--> 402 self._source.file_path,
403 )
DeserializationError: while reading
TMatrixTSym<double> version 5 as uproot.dynamic.Model_TMatrixTSym_3c_double_3e__v2 (48 bytes)
(base): <TMatrixTBase<double> (version 5) at 0x7f914b6266d8>
Base classes for TMatrixTSym<double>: (TMatrixTBase<double>)
Members for TMatrixTSym<double>: fElements
attempting to get bytes 51:440926259
outside expected range 0:3528 for this Chunk
in file DataRelease/covmatrix_noreg.root
in object /covmatrixCbin;1
During handling of the above exception, another exception occurred:
DeserializationError Traceback (most recent call last)
<ipython-input-5-1e5bf2aa6a37> in <module>
2 F = up.open("DataRelease/covmatrix_noreg.root")
3 fix_streamer(F)
----> 4 cov_unfolded = F["covmatrixCbin"].to_numpy()[0]
5 cor_unfolded = cov_unfolded / (
6 np.sqrt(np.diag(cov_unfolded))[:, None] * np.sqrt(np.diag(cov_unfolded))[None, :]
~/.local/lib/python3.7/site-packages/uproot/reading.py in __getitem__(self, where)
2080
2081 else:
-> 2082 return self.key(where).get()
2083
2084 @property
~/.local/lib/python3.7/site-packages/uproot/reading.py in get(self)
2509 context = {"breadcrumbs": (), "TKey": self}
2510
-> 2511 out = cls.read(chunk, cursor, context, self._file, selffile, parent)
2512
2513 if self._fClassName not in must_be_attached:
~/.local/lib/python3.7/site-packages/uproot/model.py in read(cls, chunk, cursor, context, file, selffile, parent, concrete)
1309 return cls.postprocess(
1310 versioned_cls.read(
-> 1311 chunk, cursor, context, file, selffile, parent, concrete=concrete
1312 ),
1313 chunk,
~/.local/lib/python3.7/site-packages/uproot/model.py in read(cls, chunk, cursor, context, file, selffile, parent, concrete)
820 )
821
--> 822 self.read_members(chunk, cursor, context, file)
823
824 self.hook_after_read_members(
<dynamic> in read_members(self, chunk, cursor, context, file)
~/.local/lib/python3.7/site-packages/uproot/source/cursor.py in array(self, chunk, length, dtype, context, move)
325 if move:
326 self._index = stop
--> 327 return numpy.frombuffer(chunk.get(start, stop, self, context), dtype=dtype)
328
329 _u1 = numpy.dtype("u1")
~/.local/lib/python3.7/site-packages/uproot/source/chunk.py in get(self, start, stop, cursor, context)
400 cursor.copy(),
401 context,
--> 402 self._source.file_path,
403 )
404
DeserializationError: while reading
TMatrixTSym<double> version 5 as uproot.dynamic.Model_TMatrixTSym_3c_double_3e__v2 (48 bytes)
(base): <TMatrixTBase<double> (version 5) at 0x7f914b62ef60>
Base classes for TMatrixTSym<double>: (TMatrixTBase<double>)
Members for TMatrixTSym<double>: fElements
attempting to get bytes 51:440926259
outside expected range 0:3528 for this Chunk
in file DataRelease/covmatrix_noreg.root
in object /covmatrixCbin;1
I attached the ROOT file I am trying to read the matrix from, in case you want to test this yourself: covmatrix_noreg.zip
It successfully read the TMatrixTBase and then failed trying to get the fElements, thinking that it needed to read bytes 51 through 440926259 to get it—that upper bound is clearly bogus (unless you have 55115776 double-precision numbers in fElements!). Since the upper bound comes from the TMatrixTBase (fNelems), maybe that was filled and TMatrixTBase has the right number of bytes, but the value is wrong.
So the automatically generated streamer is not right either, even when we manage to get it directly from ROOT. I'd have to look at the file directly. I'm looking at the file you attached.
I've started #484, which can read the matrices in the file you sent, but it's very weird: it doesn't have a separate header for the TMatrixTSym, as opposed to the TMatrixBase, switch is what got it off-track and garbled the fNelems, causing it to read past the end of the object. However, even when that number is not garbled, it's not the number of serialized values: it's equal to N**2, rather than (N*(N + 1)/2. And then those elements are not counted in the number of bytes ) the number of bytes corresponds to the TMatrixBase, rather than the TMatrixTSym.
Your file was written with ROOT 5, which makes me a little wary, but the serialization of these classes hasn't been changed in 16 years. Maybe that's early enough that they don't follow standard conventions for streamers—it seems to be a "custom streamer," and having the streamer in the file doesn't help. (Maybe that's why it was not included in the first place...)
I'd feel more confident if we could test more cases, especially non-symmetric subclasses of TMatrixBase. Do you have any of those?
The relevant subclasses are TMatrixT, TMatrixTSparse, and TMatrixTSym, and I can guess that they can be specialized with any numerical type, but I think in practice, only double is ever used. PR #484 should work for you, but I'd be more comfortable if we could include tests of all three subclasses. I think this whole system is based on custom streamers.
I'm putting that PR into draft mode so that I don't accidentally merge it. We'd need to be more certain that this is the right thing to do before including it.
Would adding some tests for TMatrixT, TMatrixTSparse, and TMatrixTSym be enough for that? Maybe both with ROOT 5 and ROOT 6 files? Or is the concern more about the principle of the matter?
It's not an in-principle thing: just an example of each of the subclasses would do. The changes I had to make for the symmetric matrix were arbitrary—assuming that no changes are needed for the other cases would likely be wrong, and fixing one but not the others would really muddle things.
Also I don't think it has to be both ROOT 5 and 6—the classes look old. I don't think they were changed recently, even on the timescale of the ROOT 5 → 6 transition.
I just tried to load the matrix with the PR branch of uproot 4, and it seems to work as expected.
Though now I have the problem that the resulting object does not have a to_numpy method. How do I actually access the matrix elements?
The raw values are in obj.member("fElements"), but these haven't been corrected for the symmetric-matrix packing (only the diagonal and above—or maybe below—are stored in that 1-dimensional array). A to_numpy method that presents it as a 2-dimensional matrix would be useful; it would have to be written as a member of the
https://github.com/scikit-hep/uproot4/blob/73f494980f310756d40a736030f0c96f7610f016/src/uproot/models/TMatrixT.py#L18
class, and similar for the TMatrixT and TMatrixTSparse cases.