DTables.jl icon indicating copy to clipboard operation
DTables.jl copied to clipboard

DTable TODO/Ideas

Open krynju opened this issue 3 years ago • 5 comments

Tables.jl interface:

  • [x] DTable as a source
  • [x] DTable constructors utilizing the interface better if possible - check what is possible and what can be improved
  • [x] Schema usage/handling

Table operations:

  • [x] select (can be done with map)
  • [x] transform (can be done with map)
  • [x] groupby (discrete values or function input for continuous values) https://github.com/JuliaParallel/Dagger.jl/pull/275
  • [x] per group reductions
  • [ ] apply per group
  • [x] join (leftjoin and innerjoin done for DTable and input)
  • [x] specialized joins for: DTable and DTable input
  • [ ] indexing and make it work with other ops? (grouped index already there, but some general indexing would be nice as well)
  • [x] shuffle (internal for groupbys and indexing)

Convenience functions:

  • [x] length
  • [x] column names
  • [ ] first, last
  • [ ] describe
  • [ ] partition information
  • [ ] row distribution across partitions

Docs:

  • [ ] Best practices on partitioning the table
  • [ ] OnlineStats examples once this is merged https://github.com/joshday/OnlineStatsBase.jl/pull/26

krynju avatar Aug 29 '21 09:08 krynju

Just FYI.... we have used JuliaDB (SparseND and IndexedTables) for many many years. Our biggest issue with using this in a production application is that every combination of columns, filters, aggregate is a new "compile" (due to passing named tuples as arguments). The first "query" of any combination could be several minutes (and there are 10000's of combinations) .. the next one takes only a few seconds. We have tried @nonspecialize, data type hiding .. I think DataFrames has it right. I was hoping there would be a "Distributed DataFrames" in the future.

cwiese avatar Dec 10 '21 14:12 cwiese

Hope update the progress status

zsz00 avatar Jul 07 '22 04:07 zsz00

transform implemented in https://github.com/JuliaParallel/DTables.jl/pull/22?

jpsamaroo avatar Mar 03 '23 15:03 jpsamaroo

Hello. I'm trying to implement a few convenience functions, but I had a small question: are first and last supposed to take in a DTable, and return a DTable with the first and last chunks respectively?

Also, what is describe supposed to do on a DTable?

cc @krynju @jpsamaroo

codetalker7 avatar Aug 08 '23 12:08 codetalker7

@codetalker7 yes, I'd think it would be called like first(tbl, 5) and return a new DTable with the same chunksize as the input tbl. It may return one or more chunks, depending on how many are needed to fulfill the user's requested number of rows.

For describe, it's probably better to just show what we want:

julia> using DataFrames
julia> df = DataFrame(a=[1,2], b=[3,4])
2×2 DataFrame
 Row │ a      b
     │ Int64  Int64
─────┼──────────────
   1 │     1      3
   2 │     2      4

julia> describe(df)
2×7 DataFrame
 Row │ variable  mean     min    median   max    nmissing  eltype
     │ Symbol    Float64  Int64  Float64  Int64  Int64     DataType
─────┼──────────────────────────────────────────────────────────────
   1 │ a             1.5      1      1.5      2         0  Int64
   2 │ b             3.5      3      3.5      4         0  Int64

Since we basically want to emulate what DataFrames does for most of these things.

jpsamaroo avatar Aug 08 '23 13:08 jpsamaroo