Add `TILEDB_DATETIME_DAY` type support for Arrow
The Arrow C data interface supports date32[days]. Let's use it as a conversion for TILEDB_DATETIME_DAY in Arrow.
The date32[days] type will be useful when Arrow is used, for example, to create a Pandas DataFrame.
Since all TileDB datetime values for attributes use the same representation as NumPy, np.datetime64, we have to find a way to transform this 64bit representation into a 32bit representation as expected by Arrow.
The contents of bufferinfo.data for a TILEDB_DATETIME_DAY attribute:
228 30 0 0 0 0 0 0 19 0 0 0 0 0 0 0 248 25 0 0 0 0 0 0 203 1 0 0 0 0 0 0
but what we would like to have in the 32bit representation, achieved by this PR, is:
228 30 0 0 19 0 0 0 248 25 0 0 203 1 0 0
The possibility of overflow seems impossible given the ranges of days that both 32bit and 64bit buffers can handle.
The initial isue:
R
> library(tiledb)
> library(palmerpenguins)
> praw <- penguins_raw
> fromDataFrame(praw, "/tmp/penguinsraw")
python
>>> import tiledb
>>> a = tiledb.open("/tmp/penguinsraw/")
>>> a.df[:]
We used to get the following error. Now we are not.
TileDBError Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 a.df[:]
File ~/work/git/TileDB-Py/tiledb/multirange_indexing.py:259, in _BaseIndexer.__getitem__(self, idx)
257 self.subarray = Subarray(self.array)
258 self._set_ranges(idx)
--> 259 return self if self.return_incomplete else self._run_query()
File ~/work/git/TileDB-Py/tiledb/multirange_indexing.py:401, in DataFrameIndexer._run_query(self)
399 elif self.use_arrow:
400 with timing("buffer_conversion_time"):
--> 401 table = self.pyquery._buffers_to_pa_table()
403 columns = []
404 pa_schema = table.schema
TileDBError: TileDB-Arrow: tiledb datatype not understood ('DATETIME_DAY', cell_val_num: 1)
The underlying numpy data model has resolution increments for every power of ten. R very much does not, it has native 'Date' (integer width) and POSIXct aka Datetime (double) (and an add-on package for nanoseconds). So for the R package I mapped that at the two different corresponding resolutions:
> uri <- tempfile()
> D <- data.frame(ind = 1:10, days = Sys.Date() + 0:9, seconds = Sys.time() + 0:9)
> fromDataFrame(D, uri, col_index=1)
> schema(uri)
tiledb_array_schema(
domain=tiledb_domain(c(
tiledb_dim(name="ind", domain=c(1L,10L), tile=10L, type="INT32", filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))
)),
attrs=c(
tiledb_attr(name="days", type="DATETIME_DAY", ncells=1, nullable=FALSE, filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1)))),
tiledb_attr(name="seconds", type="DATETIME_MS", ncells=1, nullable=FALSE, filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))
),
cell_order="COL_MAJOR", tile_order="COL_MAJOR", capacity=10000, sparse=TRUE, allows_dups=TRUE,
coords_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))),
offsets_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))),
validity_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("RLE"),"COMPRESSION_LEVEL",-1)))
)
> chk <- tiledb_array(uri, return_as="data.frame")[]
> chk
ind days seconds
1 1 2024-08-29 2024-08-29 09:35:01
2 2 2024-08-30 2024-08-29 09:35:02
3 3 2024-08-31 2024-08-29 09:35:03
4 4 2024-09-01 2024-08-29 09:35:04
5 5 2024-09-02 2024-08-29 09:35:05
6 6 2024-09-03 2024-08-29 09:35:06
7 7 2024-09-04 2024-08-29 09:35:07
8 8 2024-09-05 2024-08-29 09:35:08
9 9 2024-09-06 2024-08-29 09:35:09
10 10 2024-09-07 2024-08-29 09:35:10
>
I do not know pandas very well (or, at all, really) so I am not sure why you need to bit operation logic (but maybe it just standard casting...). Can you not resort to the Arrow level representation for DAY and DATETIME_MS? If you do and I missed it, my bad.
PS This becomes clearer when I read as arrow (well: nanoarrow, at user-level return converted to Arrow):
> chk <- tiledb_array(uri, return_as="arrow")[]
> chk
Table
10 rows x 3 columns
$ind <int32 not null>
$days <date32[day] not null>
$seconds <timestamp[ms] not null>
>
How do we make sure we're not leaking the rest of the buffer?
The only difference is the data shifting. The array_ variable is being freed in the same way as before: https://github.com/TileDB-Inc/TileDB-Py/blob/687d54959417c94a236c7d38b6b3297231087bfe/tiledb/py_arrow_io_impl.h#L571