Tables.jl
Tables.jl copied to clipboard
add getrows
as per comment https://github.com/JuliaData/Tables.jl/pull/278#issuecomment-1132706343
@quinnj @bkamins @nalimilan @ablaom @ToucheSir
This PR should probably provide appropriate implementations for rowtable
, columntable
, dictrowtable
, dictcolumntable
, table
as these are defined in Tables.jl (an additional benefit of doing this will be that package developers will see how to properly implement getrows
for their own types).
Also probably it would be good to bump version number so that it can be released after merging.
Codecov Report
Merging #284 (76f1375) into main (60c080c) will increase coverage by
0.06%
. The diff coverage is100.00%
.
@@ Coverage Diff @@
## main #284 +/- ##
==========================================
+ Coverage 94.88% 94.94% +0.06%
==========================================
Files 7 7
Lines 665 673 +8
==========================================
+ Hits 631 639 +8
Misses 34 34
Impacted Files | Coverage Δ | |
---|---|---|
src/Tables.jl | 88.00% <100.00%> (ø) |
|
src/namedtuples.jl | 100.00% <100.00%> (ø) |
|
src/matrix.jl | 100.00% <0.00%> (ø) |
|
src/fallbacks.jl | 98.60% <0.00%> (+<0.01%) |
:arrow_up: |
Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.
This PR should probably provide appropriate implementations for
rowtable
,columntable
,dictrowtable
,dictcolumntable
,table
as these are defined in Tables.jl (an additional benefit of doing this will be that package developers will see how to properly implementgetrows
for their own types).
I'm totally unfamiliar with this package so I was hoping to just get the stub in to get things started. I can have a look around and see if I can manage to write those implementations as well.
I can have a look around and see if I can manage to write those implementations as well.
These should not be that hard. Of course we can do it later.
@quinnj, @nalimilan - do you think we can have some default fallback implementation of this method?
Done! Can you unlock CI?
good to go?
@quinnj, @nalimilan - do you think we can have some default fallback implementation of this method?
Indeed, why not have getrows
return a RowTable
for row-oriented tables and a ColumnTable
for column-oriented tables? For row-oriented tables, a simple solution would be to call Tables.rows
, ensure the result is a vector (and if not call collect
on it), and then call getindex
or view
on it. For column-oriented tables, call Tables.columns
, and for each column ensure it is a vector (calling collect
on it if necessary) and call getindex
or view
on it. In both cases, instead of calling collect
it would be better in term of memory use to allocate a vector of the final size and fill it on the fly, but it's more tricky to do so maybe not worth it for a first fallback implementation.
EDIT: actually what I described is just calling getrows(Tables.rowtable(tbl), ...))
when Table.isrowtable(tbl)
and getrows(Tables.columntable(tbl), ...))
otherwise. So that's quite trivial to implement.
@quinnj 🙏🏾
I have left one comment about allowed inds
+ Tables.jl version needs to be bumped.
@quinnj What about throwing an error as discussed above?
I agree with @nalimilan that it is better to throw an error - at least for the reference implementations (so users, by looking at the code can see what is expected).
As I have commented above - I think the simplest thing to check is if the returned object is RowTable
or ColumnTable
respectively.
Ok, but here's an issue I've found digging into this further (and introducing the error checks mentioned); the current implementation of getrows
for ColumnTable
doesn't follow the getrows
docs, since it returns a ColumnTable
instead of return an indexable object of rows.
.
The point of this is we always get either a single row or collection of rows, right? Even if the input is column-oriented? Or should we update the docs and say if inds
is a collection, you get a subset of original input, row or column oriented.
And indeed, @nalimilan pointed this exact issue out here.
One idea is that we could provide a generic ColumnsRows
struct that would basically be a wrapper for any result of Tables.columns
that provided an "indexable collection of rows" interface to the underlying columns (essentially iterating ColumnsRow
that we already use in the generic Tables.rows
fallback for column oriented tables).
I might take a stab at this approach and see what it looks like.
Or should we update the docs and say if
inds
is a collection, you get a subset of original input, row or column oriented.
This is what I think we should do. For example for DataFrames.jl I imagine that we will have the following implementation:
Tables.getrows(df::AbstractDataFrame, inds; view=nothing) = view === true ? view(df, inds, :) : df[inds, :]
as it seems most natural (and in general - it will be more efficient for a given table type to decide what to do; I assume that people will expect Tables.getrows
to be efficient).
I would set the issue of "indexable collection of rows" and "indexable collection of columns" (for completeness as separate case) since it is orthogonal to the Tables.getrows
design. I think we should provide such an API (again, in DataFrames.jl this API is provided by the eachrow
and eachcol
functions, but we could choose some other names here, as in Base Julia eachrow
and eachcol
were added later and they only support iteration interface but not indexing interface).
So given that we want this to be more of a general "subset" function, I'd like to change the name from getrows
. 1) I think Tables.getrows
is too similar to Tables.rows
and it seems hard to explain the difference since they seem so similar and 2) for column-oriented tables, you're not really getting "rows" back, but you're just getting a subset of the original table.
I'm going to propose we call this Tables.subset
to be more clear.
Tables.subset
is very good for me as it is consistent in terminology with DataFrames.jl :).
@ablaom @ToucheSir @CarloLucibello - are you OK with this?
Alternatively we could have 2 functions:
-
Table.getrow
to get a single row (and it would accept a single index) -
Table.subset
to get a subset of a table and return a table (and it would expect a collection of indices)
Ok, cleaned up implementation here: https://github.com/JuliaData/Tables.jl/pull/292
Tables.subset
is a reasonable choice, fine for me
Mmm. I'm not super happy with removing "rows" from the the name. If I'm new to tables in julia, "subset" could mean anything. I similarly never liked name like "select" . Select what? I want to know which axis I'm slicing on.
How about rowsubset
?
But that's kind of the point of our discussion/decision above: this function isn't necessarily returning "rows" if the original table input is column-oriented; rather, it returns a "subset" of the original table, preserving whether the original table was column or row-oriented. Thus, Tables.subset
(note the Tables.
prefix!) seems clearest to what the function is actually doing.
Yes, I get that it is not returning "rows" in the sense of not returning a row iterator. But I'm not sure general users equate "rows" with "row iterator", maybe just Tables.jl developers do that 😉 .
I also appreciate these name decisions are very difficult to get right and to get consensus on. Just putting in my two cents.
And I agree that aligning with DataFrames - which most users will know well - makes sense.
One could always go maximally explicit with something like Tables.subset_along_row_dim
, but I imagine folks would not be terribly enthused about that name :P