duckdb-rs icon indicating copy to clipboard operation
duckdb-rs copied to clipboard

Difficulty using the stream_arrow API

Open ianthetechie opened this issue 1 year ago • 3 comments

The stream_arrow API on Statement is rather clunky to use, since it requires a SchemaRef argument. I'm not quite sure why query_arrow doesn't require this, but stream_arrow does.

Additionally, it's not documented what the differences between the APIs are. I would presume that the stream_arrow API is going to be more memory efficient (as best I can tell at a glance, query_arrow loads everything into memory and doesn't release till it's done), but this isn't explicitly documented anywhere.

ianthetechie avatar Dec 30 '24 04:12 ianthetechie

What I'm currently doing to hack around manually specifying a schema is this:

fn fetch_schema_for_query(db: &Connection, sql: &str) -> duckdb::Result<SchemaRef> {
    // Append "LIMIT 0" so DuckDB returns the same columns but no rows.
    let schema_sql = format!("{} LIMIT 0", sql);

    let mut statement = db.prepare(&schema_sql)?;
    let arrow_result = statement.query_arrow([])?;

    Ok(arrow_result.get_schema())
}

It's extremely hackish in my opinion, but it does work. I feel like I'm either missing something obvious in the API or else it's just not very ergonomic. Any ideas?

I can also confirm (for any googlers landing here) that the stream_arrow API uses significantly less RAM while (in my use case) having roughly equivalent throughput, so it's worth considering if your dataset might not fit in RAM / you'd like to do other things at the same time.

ianthetechie avatar Dec 30 '24 06:12 ianthetechie

It seems that statement.execute([]) has the same effect as the above and is probably less fragile. Though both approaches do seem to end up using some amount of RAM that's proportional with the result set size. Any insights as to why that is?

ianthetechie avatar Dec 31 '24 01:12 ianthetechie

The stream_arrow API on Statement is rather clunky to use, since it requires a SchemaRef argument. I'm not quite sure why query_arrow doesn't require this, but stream_arrow does.

Additionally, it's not documented what the differences between the APIs are. I would presume that the stream_arrow API is going to be more memory efficient (as best I can tell at a glance, query_arrow loads everything into memory and doesn't release till it's done), but this isn't explicitly documented anywhere.

You might want to look over the PR that adds the API.

nicosuave avatar Jan 11 '25 17:01 nicosuave