datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

Make a convenience function to register a single `RecordBatch` as a table from SessionContext

Open alamb opened this issue 3 years ago • 2 comments

Is your feature request related to a problem or challenge? Please describe what you are trying to do.

For testing (and those new to DataFusion) it would be very convenient to run a query against a single RecordBatch as a table. This can be done with MemTable but it adds that much more boiler plate / cognitive overhead to using DataFusion.

For example the MemTable allows defining multiple "streams" of record batches in multiple partitions (this is what the Vec<Vec<_>> is all about)

However, most of the uses of MemTable in DataFusion (see query) look like this:

    let table1 = MemTable::try_new(schema1, vec![vec![batch1]])?;
    ctx.register_table("aa", Arc::new(table1))?;

https://docs.rs/datafusion/11.0.0/datafusion/execution/context/struct.SessionContext.html#method.register_table

Describe the solution you'd like

What I would like is a function on SessionContext that does that work

impl SessionContext {

...
  /// Registers the RecordBatch as the specified table name
  fn register_batch(table_name: &str, batch: RecordBatch) -> Result<Option<Arc<dyn TableProvider>>> {
     // make a memtable here, return it
     todo!()
  }
}

And then replace the uses of MemTable in the datafusion tests with this new function Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context

I think this is a good first issue as the code already exists and the changes would be largely mechanical: updating the codebase / examples / documentation would be a good exercise for someone to get more experience with it

alamb avatar Sep 09 '22 16:09 alamb

Nice. tpch.rs for example does something like this.

https://github.com/apache/arrow-datafusion/blob/master/benchmarks/src/bin/tpch.rs#L1044

Maybe also a corresponding read_batch() function for the same use case but where it doesn't need a table, but just needs a new dataframe from the RecordBatch?

kmitchener avatar Sep 09 '22 17:09 kmitchener

I agree a read_batch would also be helpful

alamb avatar Sep 09 '22 17:09 alamb