DataFrames.jl icon indicating copy to clipboard operation
DataFrames.jl copied to clipboard

Formalize API for column vectors

Open pdeffebach opened this issue 3 years ago • 4 comments

I feel like the question about datframes with distributed arrays comes up a lot. My impression is that we don't know, for sure, if a Dagger array etc. can "just work" as a column in a DataFrame.

I think I might try to write a custom vector type and then put it in a data frame and see how many functions I can call for it before it becomes a normal vector. Then we can assess to what extent DataFrames can support Dask-like operations just by changing the vector type.

pdeffebach avatar Nov 30 '20 00:11 pdeffebach

This is a great idea; in particular, it would be great to document which functions/methods are expected to work along with how they're used in DataFrames in different operations. Happy to help with this effort.

quinnj avatar Nov 30 '20 04:11 quinnj

The first candidates that would break are fast aggregations like combine(gdf, :x => sum). In general - all cases when DataFrames.jl "internally" creates a column it is likely to assume that it is a "standard" vector. Similarly in many operations DataFrames.jl internally creates Vectors for processing data (see e.g. at GroupedDataFrame struct definition).

Having said that I think it should be doable to add "distributed" support to DataFrames.jl in the long run. However, probably we would need to have some API that would communicate to DataFrames.jl how distribution is performed (as if you have distributed vectors most likely you want to process them in a way that takes this into account).

bkamins avatar Nov 30 '20 08:11 bkamins

Yeah I have no idea how distributed computing works, or threading for that matter. Still I will put this on the to-do list for winter break / procrastination from school.

pdeffebach avatar Nov 30 '20 14:11 pdeffebach

Somewhat related is whether we preserve the container types of input columns: https://github.com/JuliaData/DataFrames.jl/issues/2569

I don't think DataFrames has very specific requirements for columns: apart from the issue of one-based indexing, which we should investigate if somebody cares, things should work as long as the AbstractArray interface is implemented. It probably won't be fast for distributed arrays, though, since we use for i in eachindex(col) loops a lot.

nalimilan avatar Dec 01 '20 14:12 nalimilan