`from_pandas` should be more flexible than requiring a full row on ingestion

Open kylemann16 opened this issue 11 months ago • 0 comments

I am trying to convert my project from using sparse arrays to dense arrays, and I ran into a lot of problems while trying to use the same methods I had been using on sparse arrays, specifically from_pandas.

Is it correct that TileDB requires an entire row of data to be consumed at the same time in order to use from_pandas?

The data I work with is represented in a dataframe as MultiIndex and is very variable in size (State-sized LiDAR pointcloud data), with a high likelihood that 1 row of data is too large to consume at once while also running all the pre-tiledb processes I need to run over it.

To me, it should be possible to call from_pandas on a dataframe that matches your TileDB array, and have it inserted to the Array based on the indices it finds there. When I followed from_pandas through it's flow, I noticed that much of the logic required for this is already available, but skipped over or not used in favor of using a row index slice.

I have created a branch where I've written a preliminary implementation of the feature (and a test) with no interruption to current usage, and I can make a PR if you're interested in it: https://github.com/kylemann16/TileDB-Py/commit/60defc0b92057323855aa7e479e9a64e65c9e0a2

It's a pretty rudimentary implementation, and I'm certain I don't know all the implications it would have, but it passes tests and works when I use it for my project.

If I'm missing something and this is redundant, or if it's not in line with how you you'd like TileDB-py to work, I'd love to get some feedback/discussion on this going. As it currently is, from_pandas is only useful to me in a sparse array scenario.

Thank you!

Feb 11 '25 22:02 kylemann16