roapi icon indicating copy to clipboard operation
roapi copied to clipboard

Unable to load a large Delta table

Open mmuru opened this issue 4 years ago • 3 comments

I am able to load smaller Delta tables but unable to load larger Delta table (4.2 billions rows consists of 1K Parquet files). I tried to bump up batch_size but it did not help. Currently in ROAPI, I do not see an option to pre-filter in the Arrow dataset level. No option to distribute the dataset in multiple nodes (distributed dataset). Any suggestions how to achieve this in ROAPI? Are these known issues with ROAPI? Please, let me know how to fix it. Thanks.

mmuru avatar Aug 04 '21 13:08 mmuru

We currently only support datasets that fit into memory in a single machine. it's very possible that your data size exceeded the memory limit. The feature of applying partition filter when loading datasets should be pretty easy to add. So if you only need to serve a subset of the delta table and that subset fits into machine memory, then this would work for you.

Distributed sharding of large datasets is a different beast. It is something I am planning to add to ROAPI in the future, but it likely won't happen in the short term. I am happy to collaborate on such feature sooner if there is enough community interests :)

houqp avatar Aug 05 '21 04:08 houqp

@houqp: Thanks for your clarification. In real world, handling large datasets in a single machine is impossible and it requires distributed sharding across multiple nodes. I believe, ROAPI has almost every features except this functionality. It will be great valuable to support this feature in ROAPI at the earliest. I am happy discuss and collaborate on this.

In the meantime, I will try to explore your suggestions. FYI, it throws exception if delta table has partitions. I need to dig this issue.

mmuru avatar Aug 05 '21 15:08 mmuru

I totally agree with you @mmuru. ROAPI was initially created to serve small to midsize datasets. I wanted to validate the end user experience first before start to work on distributed serving. I am busy working on a new datafusion release lately, but I plan to get back to ROAPI after that.

FYI, it throws exception if delta table has partitions. I need to dig this issue. This might be a bug in the delta-rs crates that I also happen to maintain. Do you mind sharing the error with me in a separate issue? I can help look into that as well.

houqp avatar Aug 06 '21 06:08 houqp

this is now supported by setting use_memory_table to false.

houqp avatar Nov 22 '23 06:11 houqp