uproot5
uproot5 copied to clipboard
`uproot.dask` is turning TBranches of fixed-size C arrays into Dask arrays with shape `(num_entries,)`, rather than `(num_entries, fixed_size)`
The issue raised in scikit-hep/uproot5#1116 is that @Jailbone's test case creates a TTree of double[fixed_size]
(one fixed-size array per entry), and this should be read as a 2D NumPy array of shape (num_entries, fixed_size)
, but uproot.dask
is presenting it to Dask as having shape (num_entries,)
. Then, of course, Dask does wrong things with it.
Reproducer:
import uproot
import numpy as np
with uproot.recreate("test.root") as file:
file["test_tree"] = {"test_branch": np.random.random((100, 10))}
>>> uproot.open("test.root:test_tree").show()
name | typename | interpretation
---------------------+--------------------------+-------------------------------
test_branch | double[10] | AsDtype("('>f8', (10,))")
>>> uproot.open("test.root:test_tree/test_branch").array(library="np").shape
(100, 10)
(fixed_size
is 10.)
But
>>> lazy = uproot.dask("test.root:test_tree", library="np")["test_branch"]
>>> lazy.shape
(100,)
>>> lazy.compute().shape
(100, 10)
There's only one place where Uproot creates a dask.array
; it's here:
https://github.com/scikit-hep/uproot5/blob/724e3775959714274e03b57bd66e850a12508ad2/src/uproot/_dask.py#L459
Should we set the Dask array shape
in chunks
, or is that something else? If we know that the TBranch's Interpretation is AsDtype
(the only type that can have more than one dimension), we can get the part of the shape beyond the number of entries with inner_shape
:
>>> uproot.open("test.root:test_tree/test_branch").interpretation
AsDtype("('>f8', (10,))")
>>> uproot.open("test.root:test_tree/test_branch").interpretation.inner_shape
(10,)