project-ideas icon indicating copy to clipboard operation
project-ideas copied to clipboard

Tabular data container (data frames)

Open burner opened this issue 6 years ago • 5 comments
trafficstars

Pandas, R and Julia have made data frames very popular. As D is getting more interest from data scientist (e.g. eBay or AdRoll) it would be very beneficial to use one language for the entire data analysis pipeline - especially considering that D (in contrast to popular languages like Python, R or Julia) - is compiled to native machine code and gets optimized by the sophisticated LLVM backend.

Minimum requirements:

  • conversion to and from CSV
  • multi-indexing
  • column binary operations, e.g. `column1 * column2`
  • group-by on an arbitrary number of columns
  • column/group aggregations

burner avatar May 11 '19 12:05 burner

is being worked on by Prateek Nayak during gsoc 2019

burner avatar May 11 '19 12:05 burner

CC @Kriyszig

wilzbach avatar May 12 '19 11:05 wilzbach

Yes, I will be working on this project.

So far I have contacted the mentors and am exploring ndslice in mir-algorithms, while also looking into displaying the dataframe on the terminal with properly aligned columns. I'm a bit tight on time till this weekend because of final examination but after that I'll be working at my maximum capacity to realize the project. We still need to discussing the structure of index to represent multi indexed dataframes after which I'll jump onto parsing of CSV files to dataframes. At this point the dataframes will support adding multi-indexed data to the dataframe, parsing from files and writing to CSV. Next will deal with access of elements, column binary ops.

I'm mostly looking into Pandas and it's implementation of dataframes mostly because I have worked quite extensively with Python in the past. I'll update the issue with any and all progress made regarding the dataframe project

Kriyszig avatar May 12 '19 11:05 Kriyszig

Interop with pandas via JSON and msgpack might be quite helpful. I have written a streaming msgpack decoder (using msgpack-d) to work with our own simple data frame implementation, and there is some old code for reading and writing to hdf5 too.

Laeeth avatar May 25 '19 02:05 Laeeth

Initial support for dataframe has been added to mir-algorithm. Only allocation and labels access for now.

@safe pure unittest
{
    import mir.ndslice.slice;
    import mir.ndslice.allocation: slice;

    import std.datetime.date;

    auto dataframe = slice!(double, Date, string)(4, 3);
    assert(dataframe.length == 4);
    assert(dataframe.length!1 == 3);
    assert(dataframe.elementCount == 4 * 3);

    static assert(is(typeof(dataframe) ==
        Slice!(double*, 2, Contiguous, Date*, string*)));

    // Dataframe labels are contiguous 1-dimensional slices.

    // Fill row labels
    dataframe.label[] = [
        Date(2019, 1, 24),
        Date(2019, 2, 2),
        Date(2019, 2, 4),
        Date(2019, 2, 5),
    ];

    assert(dataframe.label!0[2] == Date(2019, 2, 4));

    // Fill column labels
    dataframe.label!1[] = ["income", "outcome", "balance"];

    assert(dataframe.label!1[2] == "balance");

    // Change label element
    dataframe.label!1[2] = "total";
    assert(dataframe.label!1[2] == "total");

    // Attach a newly allocated label
    dataframe.label!1 = ["Income", "Outcome", "Balance"].sliced;

    assert(dataframe.label!1[2] == "Balance");
}

9il avatar May 26 '19 01:05 9il