uproot5 icon indicating copy to clipboard operation
uproot5 copied to clipboard

Incorrect writing of jagged arrays into a TTree

Open vpec0 opened this issue 4 months ago • 23 comments

When writing a jagged array into a TTree in an output ROOT file, if the array is larger than the TBasket, segmentation violation occurs when attempting to read the data in a ROOT session. UpRoot stores incorrect number of baskets used which possibly leads to the issue in ROOT.

Simple example:

of=ur.recreate('test.root')
of['tree']={'wfm':ak.from_regular(ak.Array(np.arange(32000*5*6,dtype='float').reshape(32000,5,6)),axis=-2)}
of['tree'].num_baskets # this is equal to 1
of.close()

After opening in ROOT (note the size of the basket and number of baskets for branch wfm vs the size of the branch):

$ root test.root
root [0] 
Attaching file test.root as _file0...
(TFile *) 0x7f9e8df4a730
root [1] tree->Print()
******************************************************************************
*Tree    :tree      :                                                        *
*Entries :    32000 : Total =         7937482 bytes  File  Size =    1410225 *
*        :          : Tree compression factor =   5.63                       *
******************************************************************************
*Br    0 :nwfm      : nwfm/I                                                 *
*Entries :    32000 : Total  Size=     128552 bytes  File Size  =        729 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression= 175.68     *
*............................................................................*
*Br    1 :wfm       : wfm[nwfm][6]/D                                         *
*Entries :    32000 : Total  Size=    7808656 bytes  File Size  =    1408155 *
*Baskets :        1 : Basket Size=      32000 bytes  Compression=   5.54     *
*............................................................................*
root [2] tree->Draw("wfm")
Info in <TCanvas::MakeDefCanvas>:  created default TCanvas with name c1
Warning in <TBasket::ReadBasketBuffers>: basket:wfm has fNevBuf=32000 but fEntryOffset=0, pos=22484, len=7808078, fNbytes=1408155, fObjlen=7808008, trying to repair

 *** Break *** segmentation violation
...

I note here that if the stored array is not jagged, ROOT gives no segmentation violation even though the size of the branch is larger than a single basket and the number of baskets for the branch is still 1.

vpec0 avatar Aug 29 '25 13:08 vpec0

To add here, the data cannot be retrieved even with UpRoot.

>>> with ur.open('test.root:tree') as t :
...     d = t.arrays()
...
Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/uproot/interpretation/numerical.py", line 359, in basket_array
    output = data.view(dtype).reshape((-1, *shape))
ValueError: cannot reshape array of size 20000 into shape (6)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.13/site-packages/uproot/behaviors/TBranch.py", line 889, in arrays
    _ranges_or_baskets_to_arrays(
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~^
        self,
        ^^^^^
    ...<9 lines>...
        interp_options,
        ^^^^^^^^^^^^^^^
    )
    ^
  File "/usr/local/lib/python3.13/site-packages/uproot/behaviors/TBranch.py", line 3204, in _ranges_or_baskets_to_arrays
    uproot.source.futures.delayed_raise(*obj)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^^^^
  File "/usr/local/lib/python3.13/site-packages/uproot/source/futures.py", line 38, in delayed_raise
    raise exception_value.with_traceback(traceback)
  File "/usr/local/lib/python3.13/site-packages/uproot/behaviors/TBranch.py", line 3140, in basket_to_array
    basket_array = interpretation.basket_array(
        basket.data,
    ...<6 lines>...
        interp_options,
    )
  File "/usr/local/lib/python3.13/site-packages/uproot/interpretation/jagged.py", line 180, in basket_array
    content = self._content.basket_array(
        data, None, basket, branch, context, cursor_offset, library, options
    )
  File "/usr/local/lib/python3.13/site-packages/uproot/interpretation/numerical.py", line 361, in basket_array
                raise ValueError(
    ...<3 lines>...
                ) from err
ValueError: basket 0 in tree/branch /tree;1:wfm has the wrong number of bytes (160000) for interpretation AsDtype("('>f8', (6,))")
in file test.root

vpec0 avatar Sep 04 '25 10:09 vpec0

Thank you for reporting this, @vpec0! I'll look into it.

ariostas avatar Sep 04 '25 13:09 ariostas

As far as I know, Uproot has no mechanism for splitting an array into multiple baskets, so it makes sense that it always writes a single basket for the branch.

It doesn't seem like the issue is related to the size of the array being larger than what the TBasket can hold, because even a small array fails to be written correctly.

>>> of=ur.recreate('test.root')
... of['tree']={'wfm': ak.from_regular(ak.Array(np.arange(32*5*6,dtype='float').reshape(-1,5,6)))}
... print(of['tree'].num_baskets) # this is equal to 1
... of.close()
... with ur.open('test.root:tree') as t:
...     d = t.arrays()

So it looks like the issue happens when it is a jagged array of regular arrays. I'm surprised that this wasn't tested.

I'm not very familiar with the TTree writing parts, so it might take me some time to figure it out.

In the meantime, I'd invite you to switch to using RNTuples instead of a TTree. We intend to fully support RNTuples, as opposed to the limited support that we offer for TTrees. Here's how you can do it.

>>> of=ur.recreate('test.root')
... data = {'wfm': ak.from_regular(ak.Array(np.arange(32*5*6,dtype='float').reshape(-1,5,6)))}
... of.mkrntuple("rntuple", data)
... of.close()
... with ur.open('test.root:rntuple') as t:
...     d = t.arrays()

ariostas avatar Sep 04 '25 13:09 ariostas

As far as I know, Uproot has no mechanism for splitting an array into multiple baskets, so it makes sense that it always writes a single basket for the branch.

Btw...there have been cases in analysis code where having this mechanism in uproot would have been very beneficial instead of looping over chunks of the array and appending to the tree to write multiple tbaskets manually.

ikrommyd avatar Sep 04 '25 14:09 ikrommyd

In the meantime, I'd invite you to switch to using RNTuples instead of a TTree. We intend to fully support RNTuples, as opposed to the limited support that we offer for TTrees.

I don't mind using RNTuples, however, in my analysis, I do have nested records -- example:

print(data.typestr)
23 * {run: uint32, event: uint32, hasT0: uint16, end_xyz: 3 * float32, t0: float32, flash_time: var * float32, avg_wfm: var * float64, peak: var * {idx: int64, prominence: float64, ipr: float64}, wfm: var * 1024 * float64}

mkrntuple cannot write that data for me:

...
  File "/usr/local/lib/python3.13/site-packages/uproot/writing/_cascadentuple.py", line 496, in _build_field_col_records
    raise NotImplementedError(f"Form type {type(akform)} cannot be written yet")
NotImplementedError: Form type <class 'awkward.forms.indexedform.IndexedForm'> cannot be written yet

Or is there a workaround?

vpec0 avatar Sep 04 '25 14:09 vpec0

@ikrommyd yeah, that's definitely a useful feature and I'm planning to add it for RNTuples.

@vpec0 Sorry about that, it currently supports nested records, but I'm still missing a few Awkward form types. The one you point out is pretty easy to implement, so I'll aim to add it by the next release.

ariostas avatar Sep 04 '25 15:09 ariostas

@vpec0 I opened #1493 to add support for IndexedArrays. We'll have a release of Uproot next week, so it should work by then.

ariostas avatar Sep 04 '25 18:09 ariostas

Hi @vpec0 Could you please check the new release and let us know if it is fixed in https://github.com/scikit-hep/uproot5/pull/1493? Thanks!

ianna avatar Sep 08 '25 07:09 ianna

I have tested the workaround that uses RNTuple with the latest uproot version, 5.6.5. I still do get an error when attempting to write the data:

    print(data)
    outf.mkrntuple(tree_name,data)

gives the following error:

    outf.mkrntuple(tree_name,data)
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/uproot/writing/writable.py", line 1376, in mkrntuple
    ntuple.extend(ak_data)
    ~~~~~~~~~~~~~^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/uproot/writing/writable.py", line 2180, in extend
    self._cascading.extend(self._file, self._file.sink, data)
    ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.13/site-packages/uproot/writing/_cascadentuple.py", line 971, in extend
    content = data_buffers[f"{next_barekey}-data"]
              ~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^
KeyError: 'node8-data'
Command exited with non-zero status 1

@ianna I am assuming nothing has been done on the TTree side, so no point to test that, right?

vpec0 avatar Sep 08 '25 09:09 vpec0

@vpec0 nothing has been done on the TTree side yet.

Would you be able to pickle the data you're trying to save and attach it here so I can debug it? I must have missed something while implementing IndexArrays.

ariostas avatar Sep 08 '25 13:09 ariostas

Nevermind, I know what's wrong. I'll work on fixing it.

ariostas avatar Sep 08 '25 13:09 ariostas

Good, because I could not reproduce the error after I had pickled the array.

vpec0 avatar Sep 08 '25 13:09 vpec0

@vpec0 could you try using #1496 to see if it finally works?

ariostas avatar Sep 09 '25 19:09 ariostas

I think I have tried #1496 by pip-installing it in a venv:

python -m venv venv
. venv/bin/activate
pip install git+https://github.com/scikit-hep/uproot5.git@ariostas/fix_index_arrays
...

After running my script that tries saving the RNTuple. I still do get an error.

    outf.mkrntuple(tree_name,ak.Array(data))
    ~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../venv/lib/python3.13/site-packages/uproot/writing/writable.py", line 1371, in mkrntuple
    ntuple.extend(ak_form_or_data)
    ~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^
  File ".../venv/lib/python3.13/site-packages/uproot/writing/writable.py", line 2180, in extend
    self._cascading.extend(self._file, self._file.sink, data)
    ~~~~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File ".../venv/lib/python3.13/site-packages/uproot/writing/_cascadentuple.py", line 984, in extend
    col_data = data_buffers[key]
               ~~~~~~~~~~~~^^^^^
KeyError: 'node14-offsets'
Command exited with non-zero status 1

vpec0 avatar Sep 10 '25 13:09 vpec0

Okay, thank! I'll keep looking into it.

ariostas avatar Sep 10 '25 13:09 ariostas

Could you send me the output of ak.Array(data).layout.form so I can debug it? I'm not really sure what could be happening.

ariostas avatar Sep 10 '25 13:09 ariostas

Here you go:

{
    "class": "RecordArray",
    "fields": [
	"run",
	"event",
	"hasT0",
	"end_xyz",
	"t0",
	"flash_time",
	"avg_wfm",
	"peak",
	"wfm"
    ],
    "contents": [
	{
	    "class": "IndexedArray",
	    "index": "i64",
	    "content": "uint32"
	},
	{
	    "class": "IndexedArray",
	    "index": "i64",
	    "content": "uint32"
	},
	{
	    "class": "IndexedArray",
	    "index": "i64",
	    "content": "uint16"
	},
	{
	    "class": "IndexedArray",
	    "index": "i64",
	    "content": {
	        "class": "RegularArray",
	        "size": 3,
	        "content": {
	            "class": "IndexedArray",
	            "index": "i64",
	            "content": "float32"
	        }
	    }
	},
	{
	    "class": "IndexedArray",
	    "index": "i64",
	    "content": "float32"
	},
	{
	    "class": "IndexedArray",
	    "index": "i64",
	    "content": {
	        "class": "ListOffsetArray",
	        "offsets": "i64",
	        "content": {
	            "class": "IndexedArray",
	            "index": "i64",
	            "content": "float32"
	        }
	    }
	},
	{
	    "class": "IndexedArray",
	    "index": "i64",
	    "content": {
	        "class": "ListArray",
	        "starts": "i64",
	        "stops": "i64",
	        "content": "float64"
	    }
	},
	{
	    "class": "ListOffsetArray",
	    "offsets": "i64",
	    "content": {
	        "class": "RecordArray",
	        "fields": [
		    "idx",
		    "prominence",
		    "ipr"
		],
	        "contents": [
		    "int64",
		    "float64",
		    "float64"
		]
	    }
	},
	{
	    "class": "IndexedArray",
	    "index": "i64",
	    "content": {
	        "class": "ListOffsetArray",
	        "offsets": "i64",
	        "content": {
	            "class": "RegularArray",
	            "size": 1024,
	            "content": "float64"
	        }
	    }
	}
    ]
}

vpec0 avatar Sep 10 '25 14:09 vpec0

@vpec0 Thanks so much for the help! Could you try again? I think now it should be good, but I'll add more tests to my PR to make sure I didn't miss anything.

ariostas avatar Sep 10 '25 16:09 ariostas

Writing the RNTuple proceeded without any error.

However, attempts to read it back in fails:

with ur.open('testfile.root:rntuple_name') as t :
    d = t.arrays()
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Cell In[7], line 1
----> 1 t.arrays()

File .../venv/lib/python3.13/site-packages/uproot/behaviors/RNTuple.py:774, in HasFields.arrays(self, expressions, cut, filter_name, filter_typename, filter_field, aliases, language, entry_start, entry_stop, decompression_executor, array_cache, library, backend, interpreter, ak_add_doc, how, interpretation_executor, filter_branch)
772 entry_start -= cluster_offset
773 entry_stop -= cluster_offset
--> 774 arrays = uproot.extras.awkward().from_buffers(
775     form,
776     cluster_num_entries,
777     container_dict,
778     allow_noncanonical_form=True,
779     backend="cuda" if interpreter == "gpu" and backend == "cuda" else "cpu",
780 )[entry_start:entry_stop]
782 arrays = uproot.extras.awkward().to_backend(arrays, backend=backend)
783 # no longer needed; save memory

File .../venv/lib/python3.13/site-packages/awkward/_dispatch.py:41, in named_high_level_function.<locals>.dispatch(*args, **kwargs)
38 @wraps(func)
39 def dispatch(*args, **kwargs):
40     # NOTE: this decorator assumes that the operation is exposed under `ak.`
---> 41     with OperationErrorContext(name, args, kwargs):
42         gen_or_result = func(*args, **kwargs)
43         if isgenerator(gen_or_result):

File .../venv/lib/python3.13/site-packages/awkward/_errors.py:80, in ErrorContext.__exit__(self, exception_type, exception_value, traceback)
78     self._slate.__dict__.clear()
79     # Handle caught exception
---> 80     raise self.decorate_exception(exception_type, exception_value)
81 else:
82     # Step out of the way so that another ErrorContext can become primary.
83     if self.primary() is self:

File .../venv/lib/python3.13/site-packages/awkward/_dispatch.py:42, in named_high_level_function.<locals>.dispatch(*args, **kwargs)
38 @wraps(func)
39 def dispatch(*args, **kwargs):
40     # NOTE: this decorator assumes that the operation is exposed under `ak.`
41     with OperationErrorContext(name, args, kwargs):
---> 42         gen_or_result = func(*args, **kwargs)
43         if isgenerator(gen_or_result):
44             array_likes = next(gen_or_result)

File .../venv/lib/python3.13/site-packages/awkward/operations/ak_from_buffers.py:110, in from_buffers(form, length, container, buffer_key, backend, byteorder, allow_noncanonical_form, highlevel, behavior, attrs)
29 @high_level_function()
30 def from_buffers(
31     form,
(...)     41     attrs=None,
42 ):
43     """
44     Args:
45         form (#ak.forms.Form or str/dict equivalent): The form of the Awkward
(...)    108     See #ak.to_buffers for examples.
109     """
--> 110     return _impl(
111         form,
112         length,
113         container,
114         buffer_key,
115         backend,
116         byteorder,
117         highlevel,
118         behavior,
119         attrs,
120         allow_noncanonical_form,
121     )

File .../venv/lib/python3.13/site-packages/awkward/operations/ak_from_buffers.py:157, in _impl(form, length, container, buffer_key, backend, byteorder, highlevel, behavior, attrs, simplify)
151     raise TypeError(
152         "'form' argument must be a Form or its Python dict/JSON string representation"
153     )
155 getkey = regularize_buffer_key(buffer_key)
--> 157 out = _reconstitute(
158     form,
159     length,
160     container,
161     getkey,
162     backend,
163     byteorder,
164     simplify,
165     field_path=(),
166     shape_generator=lambda: (length,),
167 )
169 return wrap_layout(out, highlevel=highlevel, attrs=attrs, behavior=behavior)

File .../venv/lib/python3.13/site-packages/awkward/operations/ak_from_buffers.py:620, in _reconstitute(form, length, container, getkey, backend, byteorder, simplify, field_path, shape_generator)
611     return ak.contents.RegularArray(
612         content,
613         form.size,
614         length,
615         parameters=form._parameters,
616     )
618 elif isinstance(form, ak.forms.RecordForm):
619     contents = [
--> 620         _reconstitute(
621             content,
622             length,
623             container,
624             getkey,
625             backend,
626             byteorder,
627             simplify,
628             (*field_path, field),
629             shape_generator,
630         )
631         for content, field in zip(form.contents, form.fields)
632     ]
633     return ak.contents.RecordArray(
634         contents,
635         None if form.is_tuple else form.fields,
(...)    638         backend=backend,
639     )
641 elif isinstance(form, ak.forms.UnionForm):

File .../venv/lib/python3.13/site-packages/awkward/operations/ak_from_buffers.py:572, in _reconstitute(form, length, container, getkey, backend, byteorder, simplify, field_path, shape_generator)
569     else:
570         next_length = _adjust_length(offsets)
--> 572     content = _reconstitute(
573         form.content,
574         next_length,
575         container,
576         getkey,
577         backend,
578         byteorder,
579         simplify,
580         field_path,
581         _shape_generator,
582     )
583     return ak.contents.ListOffsetArray(
584         ak.index.Index(offsets),
585         content,
586         parameters=form._parameters,
587     )
589 elif isinstance(form, ak.forms.RegularForm):

File .../venv/lib/python3.13/site-packages/awkward/operations/ak_from_buffers.py:282, in _reconstitute(form, length, container, getkey, backend, byteorder, simplify, field_path, shape_generator)
279     (length,) = shape_generator()
280     return (_adjust_length(length),)
--> 282 data = _from_buffer(
283     backend.nplike,
284     raw_array,
285     dtype=dtype,
286     count=real_length,
287     byteorder=byteorder,
288     field_path=field_path,
289     shape_generator=_shape_generator,
290 )
291 if form.inner_shape != ():
292     data = backend.nplike.reshape(data, (length, *form.inner_shape))

File .../venv/lib/python3.13/site-packages/awkward/operations/ak_from_buffers.py:242, in _from_buffer(nplike, buffer, dtype, count, byteorder, field_path, shape_generator)
240 if not (isinstance(nplike, Jax) and nplike.is_currently_tracing()):
241     if array.size < count:
--> 242         raise TypeError(
243             f"size of array ({array.size}) is less than size of form ({count})"
244         )
245     return array[:count]
246 else:

TypeError: size of array (115496) is less than size of form (192244)

This error occurred while calling

ak.from_buffers(
RecordForm-instance
int64-instance
{'column-0-data': array([28867, 28867, 28867, 28867, 28867, 28867, 28...
allow_noncanonical_form = True
backend = 'cpu'
)

vpec0 avatar Sep 11 '25 09:09 vpec0

@vpec0 I'm so sorry it still doesn't work. Would it be easy/possible for you to somehow send me your data and code so that I can debug it and we don't have to keep going back and forth? You could attach it here, put it on CERN drive, or email it to me at ariostas[at]princeton.edu, whichever is more convenient for you.

ariostas avatar Sep 11 '25 13:09 ariostas

@ariostas I have sent you via email the path to the data file on CERN's eos and a link to my gitlab.cern.ch repository where the code is.

vpec0 avatar Sep 11 '25 14:09 vpec0

When I run my code against ca079b083, I do get errors (same as above https://github.com/scikit-hep/uproot5/issues/1490#issuecomment-3279632621), however, after upgrade to 9ef23fdb1, writing to the RNTuple and reading it back works.

vpec0 avatar Sep 15 '25 08:09 vpec0

Thank you for the patience and help, @vpec0! I'm glad that it finally works. At some point I'll take a look at the TTree side, but that's less of a priority for me right now.

ariostas avatar Sep 15 '25 13:09 ariostas