Tables.jl icon indicating copy to clipboard operation
Tables.jl copied to clipboard

add getrows

Open CarloLucibello opened this issue 2 years ago • 8 comments

as per comment https://github.com/JuliaData/Tables.jl/pull/278#issuecomment-1132706343

@quinnj @bkamins @nalimilan @ablaom @ToucheSir

CarloLucibello avatar Jun 28 '22 06:06 CarloLucibello

This PR should probably provide appropriate implementations for rowtable, columntable, dictrowtable, dictcolumntable, table as these are defined in Tables.jl (an additional benefit of doing this will be that package developers will see how to properly implement getrows for their own types).

Also probably it would be good to bump version number so that it can be released after merging.

bkamins avatar Jun 28 '22 07:06 bkamins

Codecov Report

Merging #284 (76f1375) into main (60c080c) will increase coverage by 0.06%. The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main     #284      +/-   ##
==========================================
+ Coverage   94.88%   94.94%   +0.06%     
==========================================
  Files           7        7              
  Lines         665      673       +8     
==========================================
+ Hits          631      639       +8     
  Misses         34       34              
Impacted Files Coverage Δ
src/Tables.jl 88.00% <100.00%> (ø)
src/namedtuples.jl 100.00% <100.00%> (ø)
src/matrix.jl 100.00% <0.00%> (ø)
src/fallbacks.jl 98.60% <0.00%> (+<0.01%) :arrow_up:

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

codecov[bot] avatar Jun 28 '22 07:06 codecov[bot]

This PR should probably provide appropriate implementations for rowtable, columntable, dictrowtable, dictcolumntable, table as these are defined in Tables.jl (an additional benefit of doing this will be that package developers will see how to properly implement getrows for their own types).

I'm totally unfamiliar with this package so I was hoping to just get the stub in to get things started. I can have a look around and see if I can manage to write those implementations as well.

CarloLucibello avatar Jun 28 '22 07:06 CarloLucibello

I can have a look around and see if I can manage to write those implementations as well.

These should not be that hard. Of course we can do it later.

@quinnj, @nalimilan - do you think we can have some default fallback implementation of this method?

bkamins avatar Jun 28 '22 07:06 bkamins

Done! Can you unlock CI?

CarloLucibello avatar Jun 30 '22 06:06 CarloLucibello

good to go?

CarloLucibello avatar Jun 30 '22 15:06 CarloLucibello

@quinnj, @nalimilan - do you think we can have some default fallback implementation of this method?

Indeed, why not have getrows return a RowTable for row-oriented tables and a ColumnTable for column-oriented tables? For row-oriented tables, a simple solution would be to call Tables.rows, ensure the result is a vector (and if not call collect on it), and then call getindex or view on it. For column-oriented tables, call Tables.columns, and for each column ensure it is a vector (calling collect on it if necessary) and call getindex or view on it. In both cases, instead of calling collect it would be better in term of memory use to allocate a vector of the final size and fill it on the fly, but it's more tricky to do so maybe not worth it for a first fallback implementation.

EDIT: actually what I described is just calling getrows(Tables.rowtable(tbl), ...)) when Table.isrowtable(tbl) and getrows(Tables.columntable(tbl), ...)) otherwise. So that's quite trivial to implement.

nalimilan avatar Jul 01 '22 20:07 nalimilan

@quinnj 🙏🏾

ablaom avatar Aug 02 '22 21:08 ablaom

I have left one comment about allowed inds + Tables.jl version needs to be bumped.

bkamins avatar Aug 28 '22 14:08 bkamins

@quinnj What about throwing an error as discussed above?

nalimilan avatar Aug 30 '22 07:08 nalimilan

I agree with @nalimilan that it is better to throw an error - at least for the reference implementations (so users, by looking at the code can see what is expected).

As I have commented above - I think the simplest thing to check is if the returned object is RowTable or ColumnTable respectively.

bkamins avatar Aug 30 '22 08:08 bkamins

Ok, but here's an issue I've found digging into this further (and introducing the error checks mentioned); the current implementation of getrows for ColumnTable doesn't follow the getrows docs, since it returns a ColumnTable instead of return an indexable object of rows..

The point of this is we always get either a single row or collection of rows, right? Even if the input is column-oriented? Or should we update the docs and say if inds is a collection, you get a subset of original input, row or column oriented.

quinnj avatar Sep 05 '22 22:09 quinnj

And indeed, @nalimilan pointed this exact issue out here.

One idea is that we could provide a generic ColumnsRows struct that would basically be a wrapper for any result of Tables.columns that provided an "indexable collection of rows" interface to the underlying columns (essentially iterating ColumnsRow that we already use in the generic Tables.rows fallback for column oriented tables).

I might take a stab at this approach and see what it looks like.

quinnj avatar Sep 05 '22 22:09 quinnj

Or should we update the docs and say if inds is a collection, you get a subset of original input, row or column oriented.

This is what I think we should do. For example for DataFrames.jl I imagine that we will have the following implementation:

Tables.getrows(df::AbstractDataFrame, inds; view=nothing) = view === true ? view(df, inds, :) : df[inds, :]

as it seems most natural (and in general - it will be more efficient for a given table type to decide what to do; I assume that people will expect Tables.getrows to be efficient).


I would set the issue of "indexable collection of rows" and "indexable collection of columns" (for completeness as separate case) since it is orthogonal to the Tables.getrows design. I think we should provide such an API (again, in DataFrames.jl this API is provided by the eachrow and eachcol functions, but we could choose some other names here, as in Base Julia eachrow and eachcol were added later and they only support iteration interface but not indexing interface).

bkamins avatar Sep 06 '22 06:09 bkamins

So given that we want this to be more of a general "subset" function, I'd like to change the name from getrows. 1) I think Tables.getrows is too similar to Tables.rows and it seems hard to explain the difference since they seem so similar and 2) for column-oriented tables, you're not really getting "rows" back, but you're just getting a subset of the original table.

I'm going to propose we call this Tables.subset to be more clear.

quinnj avatar Sep 06 '22 16:09 quinnj

Tables.subset is very good for me as it is consistent in terminology with DataFrames.jl :). @ablaom @ToucheSir @CarloLucibello - are you OK with this?

Alternatively we could have 2 functions:

  • Table.getrow to get a single row (and it would accept a single index)
  • Table.subset to get a subset of a table and return a table (and it would expect a collection of indices)

bkamins avatar Sep 06 '22 16:09 bkamins

Ok, cleaned up implementation here: https://github.com/JuliaData/Tables.jl/pull/292

quinnj avatar Sep 06 '22 16:09 quinnj

Tables.subset is a reasonable choice, fine for me

CarloLucibello avatar Sep 06 '22 17:09 CarloLucibello

Mmm. I'm not super happy with removing "rows" from the the name. If I'm new to tables in julia, "subset" could mean anything. I similarly never liked name like "select" . Select what? I want to know which axis I'm slicing on.

How about rowsubset ?

ablaom avatar Sep 06 '22 21:09 ablaom

But that's kind of the point of our discussion/decision above: this function isn't necessarily returning "rows" if the original table input is column-oriented; rather, it returns a "subset" of the original table, preserving whether the original table was column or row-oriented. Thus, Tables.subset (note the Tables. prefix!) seems clearest to what the function is actually doing.

quinnj avatar Sep 06 '22 22:09 quinnj

Yes, I get that it is not returning "rows" in the sense of not returning a row iterator. But I'm not sure general users equate "rows" with "row iterator", maybe just Tables.jl developers do that 😉 .

I also appreciate these name decisions are very difficult to get right and to get consensus on. Just putting in my two cents.

And I agree that aligning with DataFrames - which most users will know well - makes sense.

ablaom avatar Sep 06 '22 23:09 ablaom

One could always go maximally explicit with something like Tables.subset_along_row_dim, but I imagine folks would not be terribly enthused about that name :P

ToucheSir avatar Sep 07 '22 14:09 ToucheSir