JuliaDB.jl `groupby(identity, ...)` and `groupjoin` should create table of tables

trafficstars

This seems to be more useful generally! I think the cost incurred in the extra allocation is negligible.

We should only keep the table indexed by only those columns for which doing so incurs no further reordering. (i.e. trailing indexed columns appear in the groups)

A possible optimization here would be storing data for all the tables in a single long Columns object and constructing the table with views of this big Columns object.

Mar 10 '18 13:03 shashi

In case it is of interest, here's an example of an experimental implementation in Python called "Dee" (which is trying to implement something called The Third Manifesto).

http://www.quicksort.co.uk/DeeDoc.html#grouping-and-ungrouping-group-ungroup

I simply like the ASCII printing of the nested tables here (possibly unwieldy with more data...), something appealed to me about it:

>>> A = GROUP(IS_CALLED, ['StudentId'], 'StudentIds')
>>> print A
+----------+---------------+
| Name     | StudentIds    |
+==========+===============+
| Anne     | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S1        | |
|          | +-----------+ |
| Boris    | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S2        | |
|          | | S5        | |
|          | +-----------+ |
| Cindy    | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S3        | |
|          | +-----------+ |
| Devinder | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S4        | |
|          | +-----------+ |
+----------+---------------+

Mar 11 '18 10:03 andyferris

After thinking a bit more, I think it's a very good idea: I often end up calling table anyway inside the groupby function as I want to do something that is not implemented for Columns (and to a complete inexperienced user it may be a tougher problem to overcome). Plus, right now we sometimes end up losing some sortedness information for no reason.

Mar 11 '18 18:03 piever

One thing that I just ran into: an advantage of Columns is that tables do not allow df[I] where I is a vector of booleans or an unsorted vector. We should probably fix the first case and allow the second if df does not have any primary keys.

Mar 16 '18 16:03 piever

I often end up calling table anyway inside the groupby function as I want to do something that is not implemented for Columns (and to a complete inexperienced user it may be a tougher problem to overcome)

~~How exactly do you convert a group into a table (inexperienced user here)?~~ (figured it out)

I currently do x = groupby(identity, ... and then iterate over the groups in x with for gr in x. I then need to further group gr on a secondary grouping factor, and loop over those...

Feb 20 '19 10:02 yakir12

btw, groupby function in DataFrames.jl is much faster than JuliaDB.jl's. Why?

May 16 '19 08:05 norci

JuliaDB.jl JuliaDB.jl copied to clipboard

`groupby(identity, ...)` and `groupjoin` should create table of tables

JuliaDB.jl
JuliaDB.jl copied to clipboard