No compression of timestamps (t.tdb) after consolidation
I noticed there's no (default) compression on timestamps after consolidation, at least via the python api for sparse arrays. This wouldn't be an issue, but I also can't find a way to set a compression filter on these values (unlike user-defined attributes, dimensions, coordinates, and offsets). Am I missing a global default compression argument? I can't find any reference to one.
This becomes a source of file size with large sparse arrays.
For example:
import tiledb
import numpy as np
from itertools import product
array_path = 'test_array'
dim1 = tiledb.Dim(
name="d1",
domain=(0, 100),
dtype=np.uint64,)
dim2 = tiledb.Dim(
name="d2",
domain=(0, 100),
dtype=np.uint64,)
domain = tiledb.Domain(
dim1, dim2)
attributes = [ # define attributes
tiledb.Attr(
name='attr1', dtype=np.dtype('uint64'), fill=0)]
schema = tiledb.ArraySchema( # generate a schema
domain=domain, attrs=attributes, sparse=True, allows_duplicates=True,
coords_filters= [tiledb.filter.ZstdFilter(9)])
tiledb.Array.create(array_path, schema)
d1, d2 = np.asarray(list(product(range(0,100), range(0,100)))).T
array = tiledb.open(array_path, 'w')
# write 1
array[d1, d2] = {'attr1': np.full(10000, 1)}
# write 2
array[d1, d2] = {'attr1': np.full(10000, 2)}
array.close()
tiledb.consolidate(array_path)
Compare file sizes of a0.tdb and t.tdb in the consolidated fragment - they match suggesting they are both uncompressed uint64.
Alternatively, if it's possible to avoid creation of a t.tdb file where not necessary (e.g. if I set all timestamps to 1 while writing so all match and would be redundant information), that'd also be a great workaround. I don't know if that's more feasible than applying compression to timestamp data.