ibis icon indicating copy to clipboard operation
ibis copied to clipboard

feat(api): add `FileTable`

Open lidavidm opened this issue 3 years ago • 4 comments

This is intended to model a "table" that is actually a collection of files (local or remote), which is more common in things that look like "query engines" (e.g. Substrait, Acero, Pandas).

Substrait integration is the specific purpose here, but the definition here is much simplified compared to Substrait's. In particular: globs are not modeled, and all files are assumed to be of the same type.

lidavidm avatar Jul 29 '22 17:07 lidavidm

Would it be useful to implement this in a backend (e.g. DuckDB)? Effectively it would differ by not defining a view, instead inlining the table definition into the final query (so not really a difference to the user).

lidavidm avatar Jul 29 '22 18:07 lidavidm

Test Results

         6 files           6 suites   3m 14s :stopwatch:   3 121 tests   3 047 :heavy_check_mark:   74 :zzz: 0 :x: 18 726 runs  18 282 :heavy_check_mark: 444 :zzz: 0 :x:

Results for commit c37beb5f.

:recycle: This comment has been updated with latest results.

github-actions[bot] avatar Jul 29 '22 18:07 github-actions[bot]

Codecov Report

Merging #4293 (c37beb5) into master (3fe3fd8) will increase coverage by 10.94%. The diff coverage is 70.37%.

Impacted file tree graph

@@             Coverage Diff             @@
##           master    #4293       +/-   ##
===========================================
+ Coverage   81.59%   92.54%   +10.94%     
===========================================
  Files         180      180               
  Lines       20352    20433       +81     
  Branches     2905     2927       +22     
===========================================
+ Hits        16606    18909     +2303     
+ Misses       3345     1149     -2196     
+ Partials      401      375       -26     
Impacted Files Coverage Δ
ibis/backends/pandas/execution/generic.py 89.12% <39.28%> (-2.24%) :arrow_down:
ibis/backends/pandas/__init__.py 80.50% <66.66%> (-2.02%) :arrow_down:
ibis/expr/operations/relations.py 97.43% <91.66%> (+6.17%) :arrow_up:
ibis/expr/format.py 92.97% <100.00%> (+3.39%) :arrow_up:
ibis/expr/rules.py 90.42% <100.00%> (+14.51%) :arrow_up:
ibis/backends/base/sql/alchemy/registry.py 94.05% <0.00%> (+0.69%) :arrow_up:
ibis/expr/operations/generic.py 95.12% <0.00%> (+0.81%) :arrow_up:
ibis/backends/base/__init__.py 83.41% <0.00%> (+1.00%) :arrow_up:
ibis/expr/types/strings.py 92.71% <0.00%> (+1.98%) :arrow_up:
ibis/expr/types/numeric.py 99.19% <0.00%> (+2.40%) :arrow_up:
... and 55 more

codecov[bot] avatar Jul 29 '22 18:07 codecov[bot]

Would it be useful to implement this in a backend (e.g. DuckDB)? Effectively it would differ by not defining a view, instead inlining the table definition into the final query (so not really a difference to the user).

We should definitely try to integrate it with at least one of the file based backends to see how well would it fit, one good candidate is duckdb but datafusion and pandas should support these operations as well.

kszucs avatar Jul 29 '22 21:07 kszucs

I don't have time to push this forward right now, will re-open once I do

lidavidm avatar Oct 07 '22 17:10 lidavidm