[EPIC] ListingTable object store usage improvements
This is a list of improvements we are working on in ListingTable in DataFusion
Background
DataFusion has a ListingTable that effectively reading tables stored in one or more files in a "hive partitioned" directory structure:
So for example, give files like this:
/path/to/my_table/file1.parquet
/path/to/my_table/file2.parquet
/path/to/my_table/file3.parquet
You can create a table with a command like
CREATE EXTERNAL TABLE my_table
LOCATION '/path/to/my_table'
And the ListingTable will handle figuring out schema, and running queries against those files as though they were a single table.
Team
- @BlakeOrth
- @alamb (maintainer)
Bugs
- [x] https://github.com/apache/datafusion/issues/17212
- [x] https://github.com/apache/datafusion/issues/17049
Enhancements
- [x] https://github.com/apache/datafusion/issues/17000
- [x] https://github.com/apache/datafusion/issues/16302
- [x] https://github.com/apache/datafusion/issues/17207
- [x] https://github.com/apache/datafusion/issues/18119
- [x] https://github.com/apache/datafusion/issues/18118
- [x] https://github.com/apache/datafusion/issues/17211
- [ ] https://github.com/apache/datafusion/issues/19217
- [x] https://github.com/apache/datafusion/issues/19074
- [ ] https://github.com/apache/datafusion/issues/18827
- [ ] https://github.com/apache/datafusion/issues/19056
- [ ] https://github.com/apache/datafusion/issues/9654
- [x] https://github.com/apache/datafusion/issues/18952
- [x] https://github.com/apache/datafusion/issues/18953
- [ ] https://github.com/apache/datafusion/issues/19052
- [ ] https://github.com/apache/datafusion/issues/19055
- [ ] https://github.com/apache/datafusion/issues/18138
I added myself and @BlakeOrth as people working on this epic
A brief update here is that I think the next steps for this project are:
- Polish up https://github.com/apache/datafusion/issues/17207 so we can test / observe the improvements
- Complete https://github.com/apache/datafusion/issues/17211
@BlakeOrth has prototype PRs for both, and they need some help polishing / testing
@alamb I haven't pushed and opened a PR for #17211 yet, but I would be happy to do so if we want to start getting some feedback on the implementation. I actually think that code is more or less ready to go at this point, but I don't want to overload anyone on reviews when we still need #17207 to test/validate those changes.
I will comment in #17207 and cc you there so we can choose how to continue moving forward with that effort.
This all comes with a subtle caveat that I am currently focusing on some other efforts that unfortunately skyrocketed to the top of my priorities, so while I am more than happy to continue contributing to this effort I will be unlikely to do so for about the next week or so.
Thanks @BlakeOrth -- what I am secretly (well not so secretly anymore) is that we have this ready for DataFusion 51 which won't be until Nov, so we have some time
- #17558
If ListingTable is undergoing some work, please could this bug be looked at: https://github.com/apache/datafusion/issues/15964? ListingTable assumes objects come from a singular object store implementation.
If ListingTable is undergoing some work, please could this bug be looked at: #15964? ListingTable assumes objects come from a singular object store implementation.
Thanks @m09526 -- I would be happy to help review a PR for such an improvement
@alamb Can you add this ticket to the enhancements checklist for us?
- https://github.com/apache/datafusion/issues/18827
@alamb Can you add this ticket to the enhancements checklist for us?
- https://github.com/apache/datafusion/issues/19273