ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat: Consideration for Batch Data Retrieval Support?

Open stereoF opened this issue 1 year ago • 1 comments

Is your feature request related to a problem?

I would like to propose a feature request for your consideration: is there any plan to support data retrieval in batches?

We currently face the following scenario:

1, We are ETLing data from Trino to ClickHouse. This ETL process may involve a series of data manipulations, with the resultant data being stored in ClickHouse. 2, We read data from ClickHouse for machine learning training purposes. If the dataset is large, we might need to read the data in batches for training and updating the model.

In both of these processes, attempting to read all the data at once could encounter limitations due to the memory capacity of a single machine. However, retrieving data in batches could avoid excessive memory consumption.

Is there a plan to support batch data retrieval, or perhaps there is a better solution already available?

Describe the solution you'd like

I would like to suggest adding support for data retrieval in batches, or alternatively, providing better solutions, such as dedicated ETL components.

What version of ibis are you running?

'7.1.0'

What backend(s) are you using, if any?

trino, clickhouse

Code of Conduct

  • [X] I agree to follow this project's Code of Conduct

stereoF avatar Jan 26 '24 09:01 stereoF

hi @stereoF, thanks for opening! table.to_pandas_batches() and table.to_pyarrow_batches() are already supported, would that be sufficient for your usecase?

we're also thinking about efficient handoff to ML training from Ibis in the IbisML project (https://github.com/ibis-project/ibisml)

lostmygithubaccount avatar Jan 26 '24 13:01 lostmygithubaccount

Closing this as resolved - we already have to_pyarrow_batches() and to_pandas_batches(). If there's need for other methods, please open a specific request in the future.

jcrist avatar Aug 14 '24 18:08 jcrist