TileDB-Py Writing sparse arrays with variable length attributes bug

Consider this:

array_name = "test"
ctx = tiledb.Ctx()
dom = tiledb.Domain(
    tiledb.Dim(name="id", domain=(0, 10), dtype=np.int64),
    ctx=ctx
)
attr = tiledb.Attr(name="val", var=True, dtype=np.int64, ctx=ctx)
schema = tiledb.ArraySchema(domain=dom, sparse=True, attrs=[attr], ctx=ctx)
tiledb.SparseArray.create(array_name, schema)

vals = np.array([
    np.array([1, 2, 9], dtype=np.int64), 
    np.array([3, 4, 5], dtype=np.int64)
], dtype='O')

with tiledb.open(array_name, "w") as array:
    array[[1, 2]] = dict(val=vals)

>>> ValueError: value length (6) does not match coordinate length (2)

Only happens when the attribute dimensions in vals form a block shape. There's no issue with either of the following:

vals = np.array([
    np.array([1, 2], dtype=np.int64), 
    np.array([3, 4, 5], dtype=np.int64)
], dtype='O')


vals = np.array([
    np.array([1, 2, 9, 3], dtype=np.int64), 
    np.array([3, 4, 5], dtype=np.int64)
], dtype='O')

I think it's because numpy coalesces object types containing homogeneous subarrays.

vals_hetero = np.array([
    np.array([1, 2], dtype=np.int64),
    np.array([3, 4, 5], dtype=np.int64)
], dtype='O')

vals_homo = np.array([
    np.array([1, 2, 9], dtype=np.int64),
    np.array([3, 4, 5], dtype=np.int64)
], dtype='O')

print(vals_hetero)
>>> [array([1, 2]) array([3, 4, 5])]

print(vals_homo)
>>> [[1 2 9]
     [3 4 5]]

print(vals_hetero.size, vals_homo.size) 
>>> 2 6

The exception is raised because TileDB relies on attr_val.size checks in libtiledb.pyx#L5241.

Is there a workaround or an alternative way of constructing the object?

Mar 07 '21 10:03 lunaroverlord

Hi @lunaroverlord,

Apologies for the delayed reply. A workaround for now to prevent the NumPy array from automatically coalescing into a multi-dimensional array is by appending None (or an empty array or non-homogenous array) at the end:

vals = np.array(
   [np.array([1, 2, 9], dtype=np.int64), np.array([3, 4, 5], dtype=np.int64), None],
   dtype="O",
)

Then slice the last element out when writing to the TileDB array:

with tiledb.open(array_name, "w") as array:
    array[[1, 2]] = dict(val=vals[:-1])

We are going to see if we can add better support for this in the future so that we don't have to use this workaround.

Please let us know if you have any questions or comments.

Mar 10 '21 14:03 nguyenv

Encountering this bug now in 2024. Do you have a sense about whether this will be fixed soon?

Jan 19 '24 01:01 KalyanPalepu

This has not been high priority to look at as there's a workaround as commented above. However, we can bump the priority on this given that a few users have run into the problem now.

Jan 19 '24 19:01 nguyenv

Friendly bump +1. Also experiencing this bug @nguyenv

Jan 27 '24 08:01 llDev-Rootll

Re-opening, although I can't give a timeline to provide an alternative solution. AFAICT there's no way to handle this through numpy (b/c of the "coalescing") so we'll probably need to provide some other input mechanism.

Jan 28 '24 02:01 ihnorton

Trying to write some multi-attribute data to tileDB for tensorflow model training. The model input/output contains a combination of variable size sequential data and fixed size image data. Currently the only way that works is to store every modality in a separate tileDB array b/c of the coalescing issue, which makes creating a TensorflowTileDBDataset slow, do you have any other suggestions?

I cannot enforce my data to be of type object, as it is not under my control

Jan 28 '24 03:01 llDev-Rootll

Are you able to use this workaround?

Feb 02 '24 02:02 ihnorton

@ihnorton I am not, as I do not have control over the dataset generation process, and my dataset is very large to pre-process as I have image data as well which is also homogenous

Feb 05 '24 23:02 llDev-Rootll