JuliaDB.jl icon indicating copy to clipboard operation
JuliaDB.jl copied to clipboard

`groupby(identity, ...)` and `groupjoin` should create table of tables

Open shashi opened this issue 7 years ago • 5 comments
trafficstars

This seems to be more useful generally! I think the cost incurred in the extra allocation is negligible.

We should only keep the table indexed by only those columns for which doing so incurs no further reordering. (i.e. trailing indexed columns appear in the groups)

A possible optimization here would be storing data for all the tables in a single long Columns object and constructing the table with views of this big Columns object.

shashi avatar Mar 10 '18 13:03 shashi

In case it is of interest, here's an example of an experimental implementation in Python called "Dee" (which is trying to implement something called The Third Manifesto).

http://www.quicksort.co.uk/DeeDoc.html#grouping-and-ungrouping-group-ungroup

I simply like the ASCII printing of the nested tables here (possibly unwieldy with more data...), something appealed to me about it:

>>> A = GROUP(IS_CALLED, ['StudentId'], 'StudentIds')
>>> print A
+----------+---------------+
| Name     | StudentIds    |
+==========+===============+
| Anne     | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S1        | |
|          | +-----------+ |
| Boris    | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S2        | |
|          | | S5        | |
|          | +-----------+ |
| Cindy    | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S3        | |
|          | +-----------+ |
| Devinder | +-----------+ |
|          | | StudentId | |
|          | +===========+ |
|          | | S4        | |
|          | +-----------+ |
+----------+---------------+

andyferris avatar Mar 11 '18 10:03 andyferris

After thinking a bit more, I think it's a very good idea: I often end up calling table anyway inside the groupby function as I want to do something that is not implemented for Columns (and to a complete inexperienced user it may be a tougher problem to overcome). Plus, right now we sometimes end up losing some sortedness information for no reason.

piever avatar Mar 11 '18 18:03 piever

One thing that I just ran into: an advantage of Columns is that tables do not allow df[I] where I is a vector of booleans or an unsorted vector. We should probably fix the first case and allow the second if df does not have any primary keys.

piever avatar Mar 16 '18 16:03 piever

I often end up calling table anyway inside the groupby function as I want to do something that is not implemented for Columns (and to a complete inexperienced user it may be a tougher problem to overcome)

~~How exactly do you convert a group into a table (inexperienced user here)?~~ (figured it out)

I currently do x = groupby(identity, ... and then iterate over the groups in x with for gr in x. I then need to further group gr on a secondary grouping factor, and loop over those...

yakir12 avatar Feb 20 '19 10:02 yakir12

btw, groupby function in DataFrames.jl is much faster than JuliaDB.jl's. Why?

norci avatar May 16 '19 08:05 norci