DTables.jl
DTables.jl copied to clipboard
DTable TODO/Ideas
Tables.jl interface:
- [x] DTable as a source
- [x] DTable constructors utilizing the interface better if possible - check what is possible and what can be improved
- [x] Schema usage/handling
Table operations:
- [x] select (can be done with map)
- [x] transform (can be done with map)
- [x] groupby (discrete values or function input for continuous values) https://github.com/JuliaParallel/Dagger.jl/pull/275
- [x] per group reductions
- [ ] apply per group
- [x] join (leftjoin and innerjoin done for DTable and
input) - [x] specialized joins for: DTable and DTable input
- [ ] indexing and make it work with other ops? (grouped index already there, but some general indexing would be nice as well)
- [x] shuffle (internal for groupbys and indexing)
Convenience functions:
- [x] length
- [x] column names
- [ ] first, last
- [ ] describe
- [ ] partition information
- [ ] row distribution across partitions
Docs:
- [ ] Best practices on partitioning the table
- [ ] OnlineStats examples once this is merged https://github.com/joshday/OnlineStatsBase.jl/pull/26
Just FYI.... we have used JuliaDB (SparseND and IndexedTables) for many many years. Our biggest issue with using this in a production application is that every combination of columns, filters, aggregate is a new "compile" (due to passing named tuples as arguments). The first "query" of any combination could be several minutes (and there are 10000's of combinations) .. the next one takes only a few seconds. We have tried @nonspecialize, data type hiding .. I think DataFrames has it right. I was hoping there would be a "Distributed DataFrames" in the future.
Hope update the progress status
transform
implemented in https://github.com/JuliaParallel/DTables.jl/pull/22?
Hello. I'm trying to implement a few convenience functions, but I had a small question: are first
and last
supposed to take in a DTable
, and return a DTable
with the first and last chunks respectively?
Also, what is describe
supposed to do on a DTable
?
cc @krynju @jpsamaroo
@codetalker7 yes, I'd think it would be called like first(tbl, 5)
and return a new DTable
with the same chunksize
as the input tbl
. It may return one or more chunks, depending on how many are needed to fulfill the user's requested number of rows.
For describe
, it's probably better to just show what we want:
julia> using DataFrames
julia> df = DataFrame(a=[1,2], b=[3,4])
2×2 DataFrame
Row │ a b
│ Int64 Int64
─────┼──────────────
1 │ 1 3
2 │ 2 4
julia> describe(df)
2×7 DataFrame
Row │ variable mean min median max nmissing eltype
│ Symbol Float64 Int64 Float64 Int64 Int64 DataType
─────┼──────────────────────────────────────────────────────────────
1 │ a 1.5 1 1.5 2 0 Int64
2 │ b 3.5 3 3.5 4 0 Int64
Since we basically want to emulate what DataFrames does for most of these things.