Tables.jl icon indicating copy to clipboard operation
Tables.jl copied to clipboard

Use columntable in datavaluerows for col sources

Open davidanthoff opened this issue 2 years ago • 6 comments

@quinnj, the idea here would be that https://github.com/JuliaData/DataFrames.jl/blob/b1fdb7621fa38c381dad928317ec1de562d09e71/src/other/tables.jl#L93 could be rewritten as

IteratorInterfaceExtensions.getiterator(df::AbstractDataFrame) = Tables.datavaluerows(df)

But I'm not entirely sure this is right :) I think this PR only makes sense if the idea is that for any table that is primarily column based, one should first create a columntable before one calls the datavaluerows function. Is that so? I don't really understand the underlying mechanics there.

The benefit of this PR would be that now the instructions for any Tables.jl source that wants to also enable TableTraits.jl integration would be a bit simpler, i.e. the instructions would be to just always add

IteratorInterfaceExtensions.getiterator(x::MyType) = Tables.datavaluerows(x)
IteratorInterfaceExtensions.isiterable(::MyType) = true
TableTraits.isiterabletable(::MyType) = true

and that would work regardless of whether the source internally stores things as columns or rows.

davidanthoff avatar Apr 06 '22 00:04 davidanthoff

Actually, just out of curiosity: why is the call to Tables.columntable needed in https://github.com/JuliaData/DataFrames.jl/blob/b1fdb7621fa38c381dad928317ec1de562d09e71/src/other/tables.jl#L94 in the currently shipping version?

davidanthoff avatar Apr 06 '22 00:04 davidanthoff

Codecov Report

Merging #279 (3ddc401) into main (138c5be) will not change coverage. The diff coverage is 100.00%.

@@           Coverage Diff           @@
##             main     #279   +/-   ##
=======================================
  Coverage   94.88%   94.88%           
=======================================
  Files           7        7           
  Lines         665      665           
=======================================
  Hits          631      631           
  Misses         34       34           
Impacted Files Coverage Δ
src/tofromdatavalues.jl 100.00% <100.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more Δ = absolute <relative> (impact), ø = not affected, ? = missing data Powered by Codecov. Last update 138c5be...3ddc401. Read the comment docs.

codecov[bot] avatar Apr 06 '22 00:04 codecov[bot]

why is the call to Tables.columntable needed

I have not implemented it, but it turns type-unstable to type-stable container.

bkamins avatar Apr 06 '22 08:04 bkamins

Actually, just out of curiosity: why is the call to Tables.columntable needed in https://github.com/JuliaData/DataFrames.jl/blob/b1fdb7621fa38c381dad928317ec1de562d09e71/src/other/tables.jl#L94 in the currently shipping version?

This is necessary for performance in Query.jl, where QueryOperators expect type-stable rows. At one point, we had the case where we were calling NamedTuple(::DataFrameRow) and it was too expensive and hurt performance in Query framework. We then had the discussion that Query.jl is really only geared towards narrow tables that can return something fully type-stable from getiterator (i.e. concrete eltype), so hence the change to call Tables.columntable.

quinnj avatar Apr 09 '22 17:04 quinnj

@quinnj one more question: in the case where Tables.columnaccess(x)==false, this PR now would essentially amount to

    r = Tables.rows(x)
    s = Tables.schema(r)
    s === nothing && error("Schemaless sources cannot be passed to datavaluerows.")
    return DataValueRowIterator{datavaluenamedtuple(s), typeof(s), typeof(r)}(r)

but rereading your comment above, that would probably be type-instable, right?

The scenario that I have in the back of my mind is row wise reading from CSV.jl in combination with Query.jl. It would be nice if something like

CSV.Rows(filename) |> @filter(...) |> @select(...) |> DataFrame

would just work out of the box. My thinking had been that if we merge this PR here and then add:

IteratorInterfaceExtensions.getiterator(x::CSV.Rows) = Tables.datavaluerows(x)
IteratorInterfaceExtensions.isiterable(::CSV.Rows) = true
TableTraits.isiterabletable(::CSV.Rows) = true

it would all work, but now I'm thinking that we might end up with a type instable situation, right?

Is there some equivalent of Tables.columntable that we could use in the row iterating case that doesn't materialize all the columns, but just provides an automatic type stable named tuple iterator around a row wise source like CSV.Rows?

davidanthoff avatar Apr 10 '22 19:04 davidanthoff

Is there some equivalent of Tables.columntable that we could use in the row iterating case that doesn't materialize all the columns, but just provides an automatic type stable named tuple iterator around a row wise source like CSV.Rows?

I think for the rows case, we just want to wrap them as-is with IteratorInterfaceExtensions.getiterator(x::CSV.Rows) = Tables.datavaluerows(x) as you specified above; the point is that DataValueRowIterator is type-stable and turns any Tables.jl-compatible "row" object into a fully-typed NamedTuple w/ DataValue for missing. So I think this is PR is fine as-is. We could probably do some performance testing to confirm, but I believe that would be essentially the same perf testing I did originally when adding this functionality to Tables.jl to cover the Query.jl cases for convenience. Really the main value of this PR is allow column-based sources to more automatically be type-stable.

quinnj avatar Apr 14 '22 04:04 quinnj