Difficulty using the stream_arrow API
The stream_arrow API on Statement is rather clunky to use, since it requires a SchemaRef argument. I'm not quite sure why query_arrow doesn't require this, but stream_arrow does.
Additionally, it's not documented what the differences between the APIs are. I would presume that the stream_arrow API is going to be more memory efficient (as best I can tell at a glance, query_arrow loads everything into memory and doesn't release till it's done), but this isn't explicitly documented anywhere.
What I'm currently doing to hack around manually specifying a schema is this:
fn fetch_schema_for_query(db: &Connection, sql: &str) -> duckdb::Result<SchemaRef> {
// Append "LIMIT 0" so DuckDB returns the same columns but no rows.
let schema_sql = format!("{} LIMIT 0", sql);
let mut statement = db.prepare(&schema_sql)?;
let arrow_result = statement.query_arrow([])?;
Ok(arrow_result.get_schema())
}
It's extremely hackish in my opinion, but it does work. I feel like I'm either missing something obvious in the API or else it's just not very ergonomic. Any ideas?
I can also confirm (for any googlers landing here) that the stream_arrow API uses significantly less RAM while (in my use case) having roughly equivalent throughput, so it's worth considering if your dataset might not fit in RAM / you'd like to do other things at the same time.
It seems that statement.execute([]) has the same effect as the above and is probably less fragile. Though both approaches do seem to end up using some amount of RAM that's proportional with the result set size. Any insights as to why that is?
The
stream_arrowAPI onStatementis rather clunky to use, since it requires aSchemaRefargument. I'm not quite sure whyquery_arrowdoesn't require this, butstream_arrowdoes.Additionally, it's not documented what the differences between the APIs are. I would presume that the
stream_arrowAPI is going to be more memory efficient (as best I can tell at a glance,query_arrowloads everything into memory and doesn't release till it's done), but this isn't explicitly documented anywhere.
You might want to look over the PR that adds the API.