datafusion icon indicating copy to clipboard operation
datafusion copied to clipboard

[EPIC] ListingTable object store usage improvements

Open alamb opened this issue 4 months ago • 8 comments

This is a list of improvements we are working on in ListingTable in DataFusion

Background

DataFusion has a ListingTable that effectively reading tables stored in one or more files in a "hive partitioned" directory structure:

So for example, give files like this:

/path/to/my_table/file1.parquet
/path/to/my_table/file2.parquet
/path/to/my_table/file3.parquet

You can create a table with a command like

CREATE EXTERNAL TABLE my_table
LOCATION '/path/to/my_table'

And the ListingTable will handle figuring out schema, and running queries against those files as though they were a single table.

Team

  • @BlakeOrth
  • @alamb (maintainer)

Bugs

  • [x] https://github.com/apache/datafusion/issues/17212
  • [x] https://github.com/apache/datafusion/issues/17049

Enhancements

  • [x] https://github.com/apache/datafusion/issues/17000
  • [x] https://github.com/apache/datafusion/issues/16302
  • [x] https://github.com/apache/datafusion/issues/17207
  • [x] https://github.com/apache/datafusion/issues/18119
  • [x] https://github.com/apache/datafusion/issues/18118
  • [x] https://github.com/apache/datafusion/issues/17211
  • [ ] https://github.com/apache/datafusion/issues/19217
  • [x] https://github.com/apache/datafusion/issues/19074
  • [ ] https://github.com/apache/datafusion/issues/18827
  • [ ] https://github.com/apache/datafusion/issues/19056
  • [ ] https://github.com/apache/datafusion/issues/9654
  • [x] https://github.com/apache/datafusion/issues/18952
  • [x] https://github.com/apache/datafusion/issues/18953
  • [ ] https://github.com/apache/datafusion/issues/19052
  • [ ] https://github.com/apache/datafusion/issues/19055
  • [ ] https://github.com/apache/datafusion/issues/18138

alamb avatar Aug 16 '25 13:08 alamb

I added myself and @BlakeOrth as people working on this epic

alamb avatar Sep 19 '25 16:09 alamb

A brief update here is that I think the next steps for this project are:

  1. Polish up https://github.com/apache/datafusion/issues/17207 so we can test / observe the improvements
  2. Complete https://github.com/apache/datafusion/issues/17211

@BlakeOrth has prototype PRs for both, and they need some help polishing / testing

alamb avatar Oct 02 '25 09:10 alamb

@alamb I haven't pushed and opened a PR for #17211 yet, but I would be happy to do so if we want to start getting some feedback on the implementation. I actually think that code is more or less ready to go at this point, but I don't want to overload anyone on reviews when we still need #17207 to test/validate those changes.

I will comment in #17207 and cc you there so we can choose how to continue moving forward with that effort.

This all comes with a subtle caveat that I am currently focusing on some other efforts that unfortunately skyrocketed to the top of my priorities, so while I am more than happy to continue contributing to this effort I will be unlikely to do so for about the next week or so.

BlakeOrth avatar Oct 02 '25 16:10 BlakeOrth

Thanks @BlakeOrth -- what I am secretly (well not so secretly anymore) is that we have this ready for DataFusion 51 which won't be until Nov, so we have some time

  • #17558

alamb avatar Oct 03 '25 17:10 alamb

If ListingTable is undergoing some work, please could this bug be looked at: https://github.com/apache/datafusion/issues/15964? ListingTable assumes objects come from a singular object store implementation.

m09526 avatar Nov 05 '25 14:11 m09526

If ListingTable is undergoing some work, please could this bug be looked at: #15964? ListingTable assumes objects come from a singular object store implementation.

Thanks @m09526 -- I would be happy to help review a PR for such an improvement

alamb avatar Nov 05 '25 22:11 alamb

@alamb Can you add this ticket to the enhancements checklist for us?

  • https://github.com/apache/datafusion/issues/18827

BlakeOrth avatar Nov 19 '25 19:11 BlakeOrth

@alamb Can you add this ticket to the enhancements checklist for us?

  • https://github.com/apache/datafusion/issues/19273

BlakeOrth avatar Dec 11 '25 00:12 BlakeOrth