TypedTables.jl
TypedTables.jl copied to clipboard
Add methods to get a subset of columns
Maybe it is my unfamiliarity with working with NamedTuples but I found it awkward to get a subset of columns when I was porting my DataFrames code to use TypedTables. I came up with the following functions:
https://github.com/GlenHertz/TypedTables.jl/commit/388cadf8d7d92f7ee18d1dc53055cb3b76befec7
I'm not sure if indexing a column by Int makes sense but I do like indexing by Symbols.
Let me know if you think these are useful or if there is an alternate way to do it. I also created ncolumns to get the number of columns.
It’s not just you - we need some version of this functionality.
I was vaguely thinking getproperties would be generic. Some kind of select like JuliaDB would also be nice.
@piever and I have been discussing some things over here. I also put up some sample code of a select operation that would apply to any Tables.jl implementor. At least for simple operations, I think it'd be great to have a core set of implementations gathered somewhere together.
Cool! I actually think most of these are just filling out the obvious functions (which don't exist yet) of what you are calling the "PropertyAccessible" interface.
What I'm getting at is having a variety of operations which work on things (like NamedTuples) that have properties would seem sufficient to me. If an object supports getting one property, why not have an operation (I'm calling it "getproperties" in my head) that returns multiple properties? Hopefully it returns a similar container type, but it could just default to NamedTuple, which is still "PropertyAccessible".
Such an operation would work on individual rows and on Tables.columns(table), as well as tables that directly support getproperty, or collections of rows (Tables.rows(table)) via map or broadcast (I have some operations like this here already). In MinimumViableTables.jl, I was experimenting with a similar operation but using the relational-algebra term project, here.
(I will note that I strongly believe it simplies things for end users greatly by just staying a table must support iteration of PropertyAccessible rows AND be PropertyAccessible for iterable columns - I'm trying to imagine a "DataStream" or "Query-able" object where this interface would be impossible but perhaps I'm not imaginative enough and I'd like to hear more).
select is also a good name, by the way, for tables at least (slightly less so for generic named tuples, structs, etc). I am also wondering if/where we can fit in the functionality of the JuliaDB work on select which support some kinds of transformations on the columns (like creating a new columns via a pair like here).
To be clear - my suggested path here is to create another package called Properties.jl and pad out exactly what the "PropertyAccessible" interface really means (with an eye to the fact that it is useful for table manipulation). That package would implement this interface for NamedTuples but let other packages easily extend the methods for other custom types.
One related thing: I believe the PropertyAccessible interface should also be useful for matlab style struct arrays (see here for a julia 1.0 compatible implementation). There seems to be a good opportunity for unification here.
Yes! That looks great.
I kind-of feel a nice properties interface + a few container types and we've solved the struct-of-arrays vs arrays-of-struct problem pretty neatly, as well as giving users access to some basic "table"-like structures which they can manipulate with typical Julia code.
I really like the Property Accessible interface but the select concept seems a bit like a database API put onto Julia. For database users that will probably be natural but I'm concerned if it will be a good fit with base Julia. I find TypedTables (and Tables.jl) much more composable than DataFrames as the abstractions almost disappear (while DataFrames seems more like a black box with odd names for the API but probably familiar to R users). The difficulty with composability is sometimes the operation is not provided in the package since Base already provides it and it can be frustrating to find out how to do something without good docs -- but when you make the discovery then the pattern is useful across a lot of Base. It is just an open question but if there is already something like select already in Base or it can be generalized beyond Tables/databases that would be my preferred solution. getindex allows the user to use multiple indices so maybe getproperty should too?? With the broadcasting change coming rather recently to Julia I'm not sure if it makes sense to use broadcasting to get a subset of multiple columns or not (eg getproperty.(obj, collection_of_propertynames)?
There's an epic syntax hack here waiting to be exploited. @c42f please dissuade me. :smile:
You get a single column of a table via . - e.g. table.column.
What if you want multiple columns? That's easy... use . with a tuple of multiple symbols:
table.(:column1, :column2)
This can be achieved via a nasty hack of broadcasted. Here's a not-so-type-stable example:
julia> Base.Broadcast.broadcasted(t::Union{Table, FlexTable}, s::Symbol...) = Table(NamedTuple{s}(map(name -> TypedTables.GetProperty(name)(t), s)))
julia> t = Table(name = ["Alice", "Bob", "Charlie"], age = [25, 42, 37])
Table with 2 columns and 3 rows:
name age
┌─────────────
1 │ Alice 25
2 │ Bob 42
3 │ Charlie 37
julia> t.(:age, :name)
Table with 2 columns and 3 rows:
age name
┌─────────────
1 │ 25 Alice
2 │ 42 Bob
3 │ 37 Charlie
Since NamedTuples aren't Callable we could do the same thing with those, I suppose.
Totally nasty, though.
I... I'm not sure what to say to about that
It's clearly a semantic abuse, but is it worse than the getindex hack in base for typed comprehensions?
~~Presumably t[(:age, :name)] would make more sense and be almost as easy to type?~~
Presumably
t[(:age, :name)]would make more sense
No, just ignore that :-) My head is still in DataFrames land.
So... how do we feel about @Select and the backend GetProperties. Should we close this issue, or would something additional be wanted?
I guess I don't know why no one answered this one. Yes, you can do it:
@Select(col1, col2, col3)(mytable)
This returns a new typedtable of the columns.
But, the columns within the new table are aliases to the columns of the original: that is to say their names point to the same locations in memory.
So, like aliases to any array, you can update the values in the excerpt of columns and you are (must be) updating the columns of the "original" larger table, e.g.--the columns themselves are the same.
You have created 2 different names for 2 collections of the same columnar data.
And this happens very quickly indeed.
julia> t1 = Table(a=collect(1:5), b=collect(11:15), c=collect(21:25), d=collect(31:35))
Table with 4 columns and 5 rows:
a b c d
┌──────────────
1 │ 1 11 21 31
2 │ 2 12 22 32
3 │ 3 13 23 33
4 │ 4 14 24 34
5 │ 5 15 25 35
julia> t2 = @Select(a,c)(t1)
Table with 2 columns and 5 rows:
a c
┌──────
1 │ 1 21
2 │ 2 22
3 │ 3 23
4 │ 4 24
5 │ 5 25
julia> t2.a[1] = 50
50
julia> t2
Table with 2 columns and 5 rows:
a c
┌───────
1 │ 50 21
2 │ 2 22
3 │ 3 23
4 │ 4 24
5 │ 5 25
julia> t1
Table with 4 columns and 5 rows:
a b c d
┌───────────────
1 │ 50 11 21 31
2 │ 2 12 22 32
3 │ 3 13 23 33
4 │ 4 14 24 34
5 │ 5 15 25 35