chdb icon indicating copy to clipboard operation
chdb copied to clipboard

Proposal for Implementing Streaming Query Support in chDB​

Open wudidapaopao opened this issue 7 months ago • 1 comments

Currently, chDB executes queries by fetching the entire result set at once through the query_conn interface. This approach may lead to high memory usage and latency for large datasets. To address this, we propose adding ​​streaming query capabilities​​ to chDB.

The existing LocalServer in chDB initializes the execution engine via Connection::sendQuery and retrieves all results in one go using receiveResult, storing them in WriteBufferFromVector.

Proposed Changes​​

  1. ​​chDB Interface Modifications​​ ​​New send_query Interface​​: Introduce a send_query method to initialize a streaming query. This method returns a stream_local_result object with a fetch method. ​​fetch Method in stream_local_result​​: Each call to fetch returns a single row (or a chunk) in the specified format (e.g., JSON, Arrow), enabling incremental data consumption. ​​
  2. LocalServer (ClientBase) Adjustments​​ Deferred Result Retrieval​​: During the first initialization, only call Connection::sendQuery to set up the execution engine ​​without fetching results immediately​​. ​​On-Demand receiveResult Calls​​: When fetch is invoked, trigger receiveResult to retrieve a chunk of data. Once the chunk is exhausted, call receiveResult again for the next chunk. ​​Handling Blocking​​: If receiveResult is not called for an extended period, the execution engine may block.

The proposal can also address https://github.com/chdb-io/chdb/issues/265

wudidapaopao avatar Apr 15 '25 14:04 wudidapaopao

BTW, https://github.com/timeplus-io/proton is an implementation of streaming SQL engine (like Apache Flink), using ClickHouse codebase. New results can be pushed to client via HTTP/TCP

jovezhong avatar Apr 16 '25 01:04 jovezhong