datatable icon indicating copy to clipboard operation
datatable copied to clipboard

Feature Request : Reshape Data within DataTable

Open samukweku opened this issue 5 years ago • 24 comments

  • convert data from wide to long, and vice versa (similar to melt/dcast in rdatatable or melt/pivot.pd.wide_to_long in pandas)

Example :

df = dt.Frame({"A":['a','b','c'], "B":[1,3,5],"C":[2,4,6]})

Transform from wide to long :

# new shape:

A variable value 0 a B 1 1 b B 3 2 c B 5 3 a C 2 4 b C 4 5 c C 6

Transform from long to wide : A B C 0 a 1 2 1 b 3 4 2 c 5 6

It would also be beneficial if there was flexibility in the reshaping, something similar to patterns in rdatatable that allows for regular expressions.

I currently hardcode the reshaping process, by building a list of dataframes, then rbind or cbind as the case may be.

Note : rbind does not accept a generator expression, only lists. dt.rbind([df[:,f[0].extend({"var":name, "val":f[name]})] for name in df.names[1:]])

Thanks!

samukweku avatar Jun 27 '20 05:06 samukweku

I am working on this issue, kindly assign it to me, will be creating PR in few days.

aniket328 avatar Mar 13 '21 12:03 aniket328

I am working with R. I want to contribute to this task.

danijak avatar Mar 14 '21 10:03 danijak

@aniket328 I've assigned this issue to you.

@danijak Please coordinate with @aniket328 who seems to work on this issue already.

Thanks you both for your contributions!

oleksiyskononenko avatar Mar 16 '21 18:03 oleksiyskononenko

Sure @danijak, let's collaborate. :)

aniket328 avatar Mar 16 '21 21:03 aniket328

Hey @aniket328, @danijak -- are you coming from GSoC by any chance?

st-pasha avatar Mar 16 '21 23:03 st-pasha

@st-pasha, I am GSoc'2021 Aspirant. I am looking to contribute in this issue

danijak avatar Mar 16 '21 23:03 danijak

Yes @st-pasha , I will be submitting a proposal for working in the R-Project for GSoC 2021

aniket328 avatar Mar 16 '21 23:03 aniket328

@aniket328 I've assigned this issue to you.

@danijak Please coordinate with @aniket328 who seems to work on this issue already. Thanks you both for your contributions!

Thanks @oleksiyskononenko, I will check other issues as well. If there are any urgent issues to be resolved please do tell me.

danijak avatar Mar 16 '21 23:03 danijak

Hey @st-pasha @samukweku @oleksiyskononenko I would love to work on this issue for the summer of 2021. I've experience with the data.table R package and have myself used melt extensively. This is the link to the easy tests demonstrating my familiarity with the data.table R package.

theadityasam avatar Apr 12 '21 19:04 theadityasam

Dear @theadityasam, it seems that there are already two people working on it — see comments above.

oleksiyskononenko avatar Apr 20 '21 00:04 oleksiyskononenko

Will try my hands on the issue in my own fork meanwhile since I don't see any development yet.

theadityasam avatar Apr 26 '21 20:04 theadityasam

@theadityasam you're welcome. I don't see any developments either.

oleksiyskononenko avatar Apr 26 '21 20:04 oleksiyskononenko

Hi @danijak, @aniket328 -- please let us know if you guys still working on this project?

st-pasha avatar Apr 26 '21 21:04 st-pasha

@st-pasha I am not working on this project for now.

aniket328 avatar Apr 27 '21 16:04 aniket328

Since no one is working on this, I'll be working on implementing the reshape functionalities in the coming weeks.

theadityasam avatar May 09 '21 09:05 theadityasam

Hey everyone, in what order should the melt be performed? In this discussion, @st-pasha had mentioned that the R data.table takes the column-wise approach. Should that be followed over here too?

theadityasam avatar May 09 '21 17:05 theadityasam

Hi @theadityasam. Since there is no clearly preferred alternative there, the best approach is to have a parameter in melt() parameter that would control whether to do it one way or the other.

The other question then becomes what should be the default way for that parameter. As mentioned in #2677, different packages have implemented it differently, and so we can't really rely on their wisdom here. Surely, the fact that both pandas and R data.table have implemented it column-wise is a strong indicator, but the fact that more mature and professional packages such as SQL, SAS, or Stata prefer row-wise approach is also a strong indicator.

So, going forward we can do two things: 1) find the issue/PR in pandas/data.table where those features were implemented, and see whether there was a thorough discussion of the merits of different approaches; 2) ask for an opinion on relevant discussion forums, such as StackOverflow, Reddit, etc.

st-pasha avatar May 10 '21 02:05 st-pasha

Understood, the defaults can be discussed later on. Quick question, I've read the docs and came across "FExpr". In R, the transfer of data from R to C++ and vice-versa is handled through SEXP pointers. Is FExpr analogous to that? From what I understood after reading the API reference, for creating the melt feature, I need to inherit FExpre base class in the newly created melt class, add the melt code and finally and FExpr type will be returned? All the conversions between Python and C++ datatypes will be handled by this mechanism? Also, @st-pasha will you be available to mentor for this project for GSOC'21? It would be really helpful to work on this project over a rigid timeline under the guidance of mentors.

theadityasam avatar May 10 '21 16:05 theadityasam

AFAIK, R's SEXP corresponds to python's PyObject* (which we wrap into a py::oobj for automatic reference counting and more convenient API). The FExpr class, however, is more specialized: it serves as a stored expression tree. In R the arguments to functions are always passed as unevaluated expression trees, so there is no need to disambiguate.

When it comes to returning values to python, we need to wrap C++ classes/primitives into the appropriate wrapper objects. For example, in order to return an integer, we'd write return py::oint(1), a string: return py::ostring("ok"), etc. (The "o" in the name indicates that the object owns its pointer). Returning a Frame works the same: return py::Frame::oframe(new DataTable(...)) (note that C++ equivalent to python's Frame is called DataTable, for historical reasons).

st-pasha avatar May 11 '21 19:05 st-pasha

If you ever get around to implementing wide-to-long data reshape function (melt) then it would be great if you could support something like the new measure() function which was recently merged into R data.table, https://github.com/Rdatatable/data.table/pull/4731

> DT <- data.table(id=1, a_1=10, b_2=21, a_2=20)
> melt(DT, measure.vars=measure(letter, number, pattern="(.*)_(.*)"))
      id letter number value
   <num> <char> <char> <num>
1:     1      a      1    10
2:     1      b      2    21
3:     1      a      2    20
> melt(DT, measure.vars=measure(letter, number=as.integer, pattern="(.*)_(.*)"))
      id letter number value
   <num> <char>  <int> <num>
1:     1      a      1    10
2:     1      b      2    21
3:     1      a      2    20
> melt(DT, measure.vars=measure(value.name, number=as.integer, pattern="(.*)_(.*)"))
      id number     a     b
   <num>  <int> <num> <num>
1:     1      1    10    NA
2:     1      2    20    21

There seems to be no equivalent in pandas, so implementing this feature could give datatable an advantage. The closest analog is pandas.wide_to_long but it is not as flexible, see https://pandas.pydata.org/docs/user_guide/reshaping.html#reshaping-by-melt

tdhock avatar May 28 '21 14:05 tdhock

@tdhock not in pandas but there is sth similar in pyjanitor which builds off ideas in R's pivot_longer and data.table.

However, I would like to see the implementation that @st-pasha has in mind, where melt serves as the building block for more transformations.

At the moment, datatable has sparse support for string operations ... I'm not sure how patterns will be implemented.

Waiting patiently to see the implementation, with arguments that python users can grok easily

samukweku avatar May 28 '21 14:05 samukweku

Hey @tdhock @samukweku @st-pasha I'll try to implement this function however I would require some guidance on how things work in datatable. Can we schedule a 30 min call wherein someone can help me with the workings of the python package, some sort of a knowledge transfer? That would make it incredibly simple for me and I would be able to work on it faster.

theadityasam avatar May 30 '21 19:05 theadityasam