JuliaDB.jl
JuliaDB.jl copied to clipboard
`groupby(identity, ...)` and `groupjoin` should create table of tables
This seems to be more useful generally! I think the cost incurred in the extra allocation is negligible.
We should only keep the table indexed by only those columns for which doing so incurs no further reordering. (i.e. trailing indexed columns appear in the groups)
A possible optimization here would be storing data for all the tables in a single long Columns object and constructing the table with views of this big Columns object.
In case it is of interest, here's an example of an experimental implementation in Python called "Dee" (which is trying to implement something called The Third Manifesto).
http://www.quicksort.co.uk/DeeDoc.html#grouping-and-ungrouping-group-ungroup
I simply like the ASCII printing of the nested tables here (possibly unwieldy with more data...), something appealed to me about it:
>>> A = GROUP(IS_CALLED, ['StudentId'], 'StudentIds')
>>> print A
+----------+---------------+
| Name | StudentIds |
+==========+===============+
| Anne | +-----------+ |
| | | StudentId | |
| | +===========+ |
| | | S1 | |
| | +-----------+ |
| Boris | +-----------+ |
| | | StudentId | |
| | +===========+ |
| | | S2 | |
| | | S5 | |
| | +-----------+ |
| Cindy | +-----------+ |
| | | StudentId | |
| | +===========+ |
| | | S3 | |
| | +-----------+ |
| Devinder | +-----------+ |
| | | StudentId | |
| | +===========+ |
| | | S4 | |
| | +-----------+ |
+----------+---------------+
After thinking a bit more, I think it's a very good idea: I often end up calling table anyway inside the groupby function as I want to do something that is not implemented for Columns (and to a complete inexperienced user it may be a tougher problem to overcome). Plus, right now we sometimes end up losing some sortedness information for no reason.
One thing that I just ran into: an advantage of Columns is that tables do not allow df[I] where I is a vector of booleans or an unsorted vector. We should probably fix the first case and allow the second if df does not have any primary keys.
I often end up calling
tableanyway inside thegroupbyfunction as I want to do something that is not implemented forColumns(and to a complete inexperienced user it may be a tougher problem to overcome)
~~How exactly do you convert a group into a table (inexperienced user here)?~~ (figured it out)
I currently do x = groupby(identity, ... and then iterate over the groups in x with for gr in x. I then need to further group gr on a secondary grouping factor, and loop over those...
btw, groupby function in DataFrames.jl is much faster than JuliaDB.jl's. Why?