Make a convenience function to register a single `RecordBatch` as a table from SessionContext
Is your feature request related to a problem or challenge? Please describe what you are trying to do.
For testing (and those new to DataFusion) it would be very convenient to run a query against a single RecordBatch as a table. This can be done with MemTable but it adds that much more boiler plate / cognitive overhead to using DataFusion.
For example the MemTable allows defining multiple "streams" of record batches in multiple partitions (this is what the Vec<Vec<_>> is all about)
However, most of the uses of MemTable in DataFusion (see query) look like this:
let table1 = MemTable::try_new(schema1, vec![vec![batch1]])?;
ctx.register_table("aa", Arc::new(table1))?;
https://docs.rs/datafusion/11.0.0/datafusion/execution/context/struct.SessionContext.html#method.register_table
Describe the solution you'd like
What I would like is a function on SessionContext that does that work
impl SessionContext {
...
/// Registers the RecordBatch as the specified table name
fn register_batch(table_name: &str, batch: RecordBatch) -> Result<Option<Arc<dyn TableProvider>>> {
// make a memtable here, return it
todo!()
}
}
And then replace the uses of MemTable in the datafusion tests with this new function
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Additional context
I think this is a good first issue as the code already exists and the changes would be largely mechanical: updating the codebase / examples / documentation would be a good exercise for someone to get more experience with it
Nice. tpch.rs for example does something like this.
https://github.com/apache/arrow-datafusion/blob/master/benchmarks/src/bin/tpch.rs#L1044
Maybe also a corresponding read_batch() function for the same use case but where it doesn't need a table, but just needs a new dataframe from the RecordBatch?
I agree a read_batch would also be helpful