DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Functions/structs to move out of DataFrames to DataAPI (DataFrames syntax for DTable)

Open krynju opened this issue 2 years ago • 10 comments

I'm wrapping up https://github.com/JuliaParallel/Dagger.jl/pull/344 and I managed to narrow down the list of things I need to import from DataFrames to get the DataFrames-style select somewhat working

Here's the list:

Index
ColumnIndex
MultiColumnIndex
normalize_selection
ByRow
AsTable

With these in a common package I could drop the DataFrames dependency altogether to get the select syntax working

I think with some more work broadcast_pair could be moved out as well and then one could technically have DataFrames select syntax working just by implementing these two functions for their own structure:

function _manipulate(df::DataFrame, normalized_cs::Vector{Any}, copycols::Bool, keeprows::Bool)
function manipulate(dt::DataFrame, args::AbstractVector{Int}; copycols::Bool, keeprows::Bool, renamecols::Bool)

krynju avatar Jun 12 '22 17:06 krynju

We can do it relatively easily, but I think it should be a separate interface package, not DataAPI.jl.

@nalimilan - what do you think?

bkamins avatar Jun 12 '22 18:06 bkamins

Why not DataAPI.jl? It seems like it would be fine there.

quinnj avatar Jun 12 '22 22:06 quinnj

Because these are not only signatures, but also quite complex logic (probably in total it is ~1000LOC - I have not counted). I thought we want to keep DataAPI.jl lightweight so that it does not add compilation latency to packages that depend on it.

bkamins avatar Jun 13 '22 05:06 bkamins

Yes, DataAPI should remain really lightweight. We could include empty definitions there (or basic constructors) but not big implementations. These would sound more suited to something like TableTransforms, but I'm not sure they are generic enough for that. Does the Dagger implementation support any kind of Tables.jl object that may be wrapped by a DTable?

Otherwise we can create another package. We talked about having DataFramesBase in the past (https://github.com/JuliaData/DataFrames.jl/issues/1764). Though the drawback is that it can make it more difficult to develop DataFrames. We would have to define more precisely what is the API and try to keep it stable.

nalimilan avatar Jun 16 '22 12:06 nalimilan

We could include empty definitions there

There is no benefit from this. The point is to have a shared implementation so that both packages process operation specification language in the same way.

they are generic enough

They are not

Does the Dagger implementation support any kind of Tables.jl object

DTable is not DataFrames.jl specific, but this is irrelevant. The functions tht @krynju proposes to move are not related to DataFrames.jl implementation details of an AbstractDataFrame object - they only serve as pre-processing of operation specification language to cannonical form.

We would have to define more precisely what is the API and try to keep it stable.

As commented above - the objective of the functionalities that @krynju proposes to extract out is narrower than full implementation of common data frame operations. It is just pre-processing of operation specification language (and this is something essentially independent from AbstractDataFrame except for the fact that it accepts both Symbol and string as column name, but I think it is currently a common consensus that this is a desirable way to look at column names)

bkamins avatar Jun 16 '22 14:06 bkamins

Yes, this is basically all about parsing the args... provided by the user into a format that the end structure (DataFrames or DTable) can take and manipulate the data with it.

Let's take this piece of code as an example:

select(df::DTable, @nospecialize(args...); copycols::Bool=true, renamecols::Bool=true) =
    manipulate(df, map(x -> broadcast_pair(df, x), args)...,
        copycols=copycols, keeprows=true, renamecols=renamecols)

I think the ideal setup for now would be to:

  1. move broadcast_pair to an external package
  2. move the Index related things mentioned in the original post to an external package
  3. pinpoint the bare minimum interface to use broadcast_pair (like names, index etc.)
  4. leave the structure to implement it's own manipulate set of functions to handle the output of map(x -> broadcast_pair(df, x), args)...
  5. put bare definitions of select, transform, combine etc. into DataAPI, so that other structures can extend them (like the *join methods)

Even if this args parsing part of DataFrames is moved out I think it will be pretty simple to maintain compatibility as this frontend part of DataFrames is pretty stable in general

krynju avatar Jun 16 '22 15:06 krynju

This all makes sense to me; thanks for the explanations. Sounds like a great plan.

quinnj avatar Jun 16 '22 17:06 quinnj

OK. How should we call it then? If it's not specific to DataFrames the name should be relatively general, as this kind of syntax could be useful for other tables.

nalimilan avatar Jun 16 '22 19:06 nalimilan

I mark this for 1.5 release milestone (maybe it will be OK to make it in 1.4.x patch release as it should be internal). The reason is that 1.4 release will be made soon and I do not want this discussion to delay it.

bkamins avatar Jun 19 '22 09:06 bkamins

@krynju - I am clearing the milestone from this PR. Given our earlier discussions - is this PR needed in the end?

bkamins avatar Dec 25 '22 17:12 bkamins