Writing sparse arrays with variable length attributes bug
Consider this:
array_name = "test"
ctx = tiledb.Ctx()
dom = tiledb.Domain(
tiledb.Dim(name="id", domain=(0, 10), dtype=np.int64),
ctx=ctx
)
attr = tiledb.Attr(name="val", var=True, dtype=np.int64, ctx=ctx)
schema = tiledb.ArraySchema(domain=dom, sparse=True, attrs=[attr], ctx=ctx)
tiledb.SparseArray.create(array_name, schema)
vals = np.array([
np.array([1, 2, 9], dtype=np.int64),
np.array([3, 4, 5], dtype=np.int64)
], dtype='O')
with tiledb.open(array_name, "w") as array:
array[[1, 2]] = dict(val=vals)
>>> ValueError: value length (6) does not match coordinate length (2)
Only happens when the attribute dimensions in vals form a block shape. There's no issue with either of the following:
vals = np.array([
np.array([1, 2], dtype=np.int64),
np.array([3, 4, 5], dtype=np.int64)
], dtype='O')
vals = np.array([
np.array([1, 2, 9, 3], dtype=np.int64),
np.array([3, 4, 5], dtype=np.int64)
], dtype='O')
I think it's because numpy coalesces object types containing homogeneous subarrays.
vals_hetero = np.array([
np.array([1, 2], dtype=np.int64),
np.array([3, 4, 5], dtype=np.int64)
], dtype='O')
vals_homo = np.array([
np.array([1, 2, 9], dtype=np.int64),
np.array([3, 4, 5], dtype=np.int64)
], dtype='O')
print(vals_hetero)
>>> [array([1, 2]) array([3, 4, 5])]
print(vals_homo)
>>> [[1 2 9]
[3 4 5]]
print(vals_hetero.size, vals_homo.size)
>>> 2 6
The exception is raised because TileDB relies on attr_val.size checks in libtiledb.pyx#L5241.
Is there a workaround or an alternative way of constructing the object?
Hi @lunaroverlord,
Apologies for the delayed reply. A workaround for now to prevent the NumPy array from automatically coalescing into a multi-dimensional array is by appending None (or an empty array or non-homogenous array) at the end:
vals = np.array(
[np.array([1, 2, 9], dtype=np.int64), np.array([3, 4, 5], dtype=np.int64), None],
dtype="O",
)
Then slice the last element out when writing to the TileDB array:
with tiledb.open(array_name, "w") as array:
array[[1, 2]] = dict(val=vals[:-1])
We are going to see if we can add better support for this in the future so that we don't have to use this workaround.
Please let us know if you have any questions or comments.
Encountering this bug now in 2024. Do you have a sense about whether this will be fixed soon?
This has not been high priority to look at as there's a workaround as commented above. However, we can bump the priority on this given that a few users have run into the problem now.
Friendly bump +1. Also experiencing this bug @nguyenv
Re-opening, although I can't give a timeline to provide an alternative solution. AFAICT there's no way to handle this through numpy (b/c of the "coalescing") so we'll probably need to provide some other input mechanism.
Trying to write some multi-attribute data to tileDB for tensorflow model training. The model input/output contains a combination of variable size sequential data and fixed size image data. Currently the only way that works is to store every modality in a separate tileDB array b/c of the coalescing issue, which makes creating a TensorflowTileDBDataset slow, do you have any other suggestions?
I cannot enforce my data to be of type object, as it is not under my control
Are you able to use this workaround?
@ihnorton I am not, as I do not have control over the dataset generation process, and my dataset is very large to pre-process as I have image data as well which is also homogenous