TileDB-Py icon indicating copy to clipboard operation
TileDB-Py copied to clipboard

Add `TILEDB_DATETIME_DAY` type support for Arrow

Open kounelisagis opened this issue 1 year ago • 3 comments

The Arrow C data interface supports date32[days]. Let's use it as a conversion for TILEDB_DATETIME_DAY in Arrow.

The date32[days] type will be useful when Arrow is used, for example, to create a Pandas DataFrame.

Since all TileDB datetime values for attributes use the same representation as NumPy, np.datetime64, we have to find a way to transform this 64bit representation into a 32bit representation as expected by Arrow.

The contents of bufferinfo.data for a TILEDB_DATETIME_DAY attribute:

228 30 0 0 0 0 0 0 19 0 0 0 0 0 0 0 248 25 0 0 0 0 0 0 203 1 0 0 0 0 0 0

but what we would like to have in the 32bit representation, achieved by this PR, is: 228 30 0 0 19 0 0 0 248 25 0 0 203 1 0 0

The possibility of overflow seems impossible given the ranges of days that both 32bit and 64bit buffers can handle.


The initial isue:

R
> library(tiledb)
> library(palmerpenguins)
> praw <- penguins_raw
> fromDataFrame(praw, "/tmp/penguinsraw")
python
>>> import tiledb
>>> a = tiledb.open("/tmp/penguinsraw/")
>>> a.df[:]

We used to get the following error. Now we are not.

TileDBError                               Traceback (most recent call last)
Input In [3], in <cell line: 1>()
----> 1 a.df[:]

File ~/work/git/TileDB-Py/tiledb/multirange_indexing.py:259, in _BaseIndexer.__getitem__(self, idx)
    257     self.subarray = Subarray(self.array)
    258     self._set_ranges(idx)
--> 259 return self if self.return_incomplete else self._run_query()

File ~/work/git/TileDB-Py/tiledb/multirange_indexing.py:401, in DataFrameIndexer._run_query(self)
    399 elif self.use_arrow:
    400     with timing("buffer_conversion_time"):
--> 401         table = self.pyquery._buffers_to_pa_table()
    403     columns = []
    404     pa_schema = table.schema

TileDBError: TileDB-Arrow: tiledb datatype not understood ('DATETIME_DAY', cell_val_num: 1)

kounelisagis avatar Jul 09 '24 11:07 kounelisagis

The underlying numpy data model has resolution increments for every power of ten. R very much does not, it has native 'Date' (integer width) and POSIXct aka Datetime (double) (and an add-on package for nanoseconds). So for the R package I mapped that at the two different corresponding resolutions:

> uri <- tempfile()
> D <- data.frame(ind = 1:10, days = Sys.Date() + 0:9, seconds = Sys.time() + 0:9)
> fromDataFrame(D, uri, col_index=1)
> schema(uri)
tiledb_array_schema(
    domain=tiledb_domain(c(
        tiledb_dim(name="ind", domain=c(1L,10L), tile=10L, type="INT32", filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))
    )),
    attrs=c(
        tiledb_attr(name="days", type="DATETIME_DAY", ncells=1, nullable=FALSE, filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1)))),
        tiledb_attr(name="seconds", type="DATETIME_MS", ncells=1, nullable=FALSE, filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))))
    ),
    cell_order="COL_MAJOR", tile_order="COL_MAJOR", capacity=10000, sparse=TRUE, allows_dups=TRUE,
    coords_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))),
    offsets_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("ZSTD"),"COMPRESSION_LEVEL",-1))),
    validity_filter_list=tiledb_filter_list(c(tiledb_filter_set_option(tiledb_filter("RLE"),"COMPRESSION_LEVEL",-1)))
)
> chk <- tiledb_array(uri, return_as="data.frame")[]
> chk
   ind       days             seconds
1    1 2024-08-29 2024-08-29 09:35:01
2    2 2024-08-30 2024-08-29 09:35:02
3    3 2024-08-31 2024-08-29 09:35:03
4    4 2024-09-01 2024-08-29 09:35:04
5    5 2024-09-02 2024-08-29 09:35:05
6    6 2024-09-03 2024-08-29 09:35:06
7    7 2024-09-04 2024-08-29 09:35:07
8    8 2024-09-05 2024-08-29 09:35:08
9    9 2024-09-06 2024-08-29 09:35:09
10  10 2024-09-07 2024-08-29 09:35:10
> 

I do not know pandas very well (or, at all, really) so I am not sure why you need to bit operation logic (but maybe it just standard casting...). Can you not resort to the Arrow level representation for DAY and DATETIME_MS? If you do and I missed it, my bad.

eddelbuettel avatar Aug 29 '24 14:08 eddelbuettel

PS This becomes clearer when I read as arrow (well: nanoarrow, at user-level return converted to Arrow):

> chk <- tiledb_array(uri, return_as="arrow")[]
> chk
Table
10 rows x 3 columns
$ind <int32 not null>
$days <date32[day] not null>
$seconds <timestamp[ms] not null>
> 

eddelbuettel avatar Aug 29 '24 14:08 eddelbuettel

How do we make sure we're not leaking the rest of the buffer?

The only difference is the data shifting. The array_ variable is being freed in the same way as before: https://github.com/TileDB-Inc/TileDB-Py/blob/687d54959417c94a236c7d38b6b3297231087bfe/tiledb/py_arrow_io_impl.h#L571

kounelisagis avatar Oct 09 '24 12:10 kounelisagis