DataFrames.jl
DataFrames.jl copied to clipboard
Functions/structs to move out of DataFrames to DataAPI (DataFrames syntax for DTable)
I'm wrapping up https://github.com/JuliaParallel/Dagger.jl/pull/344 and I managed to narrow down the list of things I need to import from DataFrames to get the DataFrames-style select somewhat working
Here's the list:
Index
ColumnIndex
MultiColumnIndex
normalize_selection
ByRow
AsTable
With these in a common package I could drop the DataFrames dependency altogether to get the select
syntax working
I think with some more work broadcast_pair
could be moved out as well and then one could technically have DataFrames select
syntax working just by implementing these two functions for their own structure:
function _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool)
function manipulate(dt::DataFrame, args::AbstractVector{Int}; copycols::Bool, keeprows::Bool, renamecols::Bool)
We can do it relatively easily, but I think it should be a separate interface package, not DataAPI.jl.
@nalimilan - what do you think?
Why not DataAPI.jl? It seems like it would be fine there.
Because these are not only signatures, but also quite complex logic (probably in total it is ~1000LOC - I have not counted). I thought we want to keep DataAPI.jl lightweight so that it does not add compilation latency to packages that depend on it.
Yes, DataAPI should remain really lightweight. We could include empty definitions there (or basic constructors) but not big implementations. These would sound more suited to something like TableTransforms, but I'm not sure they are generic enough for that. Does the Dagger implementation support any kind of Tables.jl object that may be wrapped by a DTable
?
Otherwise we can create another package. We talked about having DataFramesBase in the past (https://github.com/JuliaData/DataFrames.jl/issues/1764). Though the drawback is that it can make it more difficult to develop DataFrames. We would have to define more precisely what is the API and try to keep it stable.
We could include empty definitions there
There is no benefit from this. The point is to have a shared implementation so that both packages process operation specification language in the same way.
they are generic enough
They are not
Does the Dagger implementation support any kind of Tables.jl object
DTable
is not DataFrames.jl specific, but this is irrelevant. The functions tht @krynju proposes to move are not related to DataFrames.jl implementation details of an AbstractDataFrame
object - they only serve as pre-processing of operation specification language to cannonical form.
We would have to define more precisely what is the API and try to keep it stable.
As commented above - the objective of the functionalities that @krynju proposes to extract out is narrower than full implementation of common data frame operations. It is just pre-processing of operation specification language (and this is something essentially independent from AbstractDataFrame
except for the fact that it accepts both Symbol
and string as column name, but I think it is currently a common consensus that this is a desirable way to look at column names)
Yes, this is basically all about parsing the args...
provided by the user into a format that the end structure (DataFrames or DTable) can take and manipulate the data with it.
Let's take this piece of code as an example:
select(df::DTable, @nospecialize(args...); copycols::Bool=true, renamecols::Bool=true) =
manipulate(df, map(x -> broadcast_pair(df, x), args)...,
copycols=copycols, keeprows=true, renamecols=renamecols)
I think the ideal setup for now would be to:
- move
broadcast_pair
to an external package - move the
Index
related things mentioned in the original post to an external package - pinpoint the bare minimum interface to use
broadcast_pair
(likenames
,index
etc.) - leave the structure to implement it's own
manipulate
set of functions to handle the output ofmap(x -> broadcast_pair(df, x), args)...
- put bare definitions of
select
,transform
,combine
etc. intoDataAPI
, so that other structures can extend them (like the*join
methods)
Even if this args
parsing part of DataFrames is moved out I think it will be pretty simple to maintain compatibility as this frontend part of DataFrames is pretty stable in general
This all makes sense to me; thanks for the explanations. Sounds like a great plan.
OK. How should we call it then? If it's not specific to DataFrames the name should be relatively general, as this kind of syntax could be useful for other tables.
I mark this for 1.5 release milestone (maybe it will be OK to make it in 1.4.x patch release as it should be internal). The reason is that 1.4 release will be made soon and I do not want this discussion to delay it.
@krynju - I am clearing the milestone from this PR. Given our earlier discussions - is this PR needed in the end?