d3-array icon indicating copy to clipboard operation
d3-array copied to clipboard

Data wide/long reshape functions

Open mhkeller opened this issue 4 years ago • 13 comments

What are your thoughts on adding data reshape functions similar to the melt and wide_to_long functions in pandas or pivoting, gather and spread in the tidyverse?

It's a very common pattern when loading data for charts, such as in the multiline example. I find myself frequently writing these reshape functions in each project and they're often some of the least literate parts of my code. They're especially distracting when trying to teach people chart concepts and they hit a big speed bump right off the bat.

Anyway, it would be a great addition to the JavaScript world. If there are other packages that have already done this that I missed, let me know. I've seen a few "let's rewrite pandas/dplyr in js" packages over the years but none ever gets completed, let alone maintained. Happy to be wrong, though, if someone has broken off these functions somewhere!

mhkeller avatar Apr 09 '20 00:04 mhkeller

Sounds interesting? I’d love to see a sketch of what these might look like. Perhaps some combination of array.flatMap and d3.group?

mbostock avatar Apr 09 '20 01:04 mbostock

Using tidyr's pivoting examples I started putting together some ideas here: https://github.com/mhkeller/pivoting.

I think the most readable is 2b but that's an older style.

mhkeller avatar Apr 11 '20 03:04 mhkeller

I’ve dropped 1 and 2a into Observable notebooks for easy tinkering:

https://observablehq.com/d/41bc065377cb7e36 https://observablehq.com/d/7021f34babd6fbf6

mbostock avatar Apr 11 '20 04:04 mbostock

If I were to write the relig_income example in vanilla JavaScript, I’d probably use array.flatMap like so:

data.columns.slice(1).flatMap(income => data.map(({religion, [income]: count}) => ({religion, income, count})))

Here’s another take on your first pivot function:

function pivot(data, columns, name, value) {
  const keep = data.columns.filter(c => !columns.includes(c));
  return data.flatMap(d => {
    const base = keep.map(k => [k, d[k]]);
    return columns.map(c => {
      return Object.fromEntries([
        ...base,
        [name, c],
        [value, d[c]]
      ]);
    });
  });
}

I haven’t evaluated the performance of any approach yet.

https://observablehq.com/d/3ea8d446f5ba96fe

Not directly related to this issue, but I’m also interested in making columnar data easier to use in JavaScript, since that should offer better performance. A column-oriented data structure is typically what I think of as a “data frame”.

mbostock avatar Apr 11 '20 04:04 mbostock

Very neat destructuring in the vanilla js example. The question I have that came up working through number two was 'What's the best API to handle transform arguments?' What I did with nested arrays I thought was a bit unwieldy and I hadn't yet gotten to implementing all of the features, such as names_pattern.

An alternative would be to limit the scope of this function and say it doesn't handle column name cleaning or casting (although I could see something like names_pattern being useful). The full workflow for someone doing the multiline example would then be something like:

  1. pivot the raw data
  2. clean with a forEach, map or filter
  3. group or rollup

For large datasets, maybe going through the data multiple times is a pain, performance wise? For the casual user, it can be nice just having one data transformation step, for sure.

I think my preference would be that if there's a manageable API, it would handy to do these transformations within pivot but not at the cost of getting lost in the arguments.

mhkeller avatar Apr 12 '20 01:04 mhkeller

I made pivot 1 as a generator for your amusement https://observablehq.com/d/ac2a320cf2b0adc4

Fil avatar Jun 24 '20 10:06 Fil

See also https://observablehq.com/@tomfevrier/kiwis

Fil avatar Jul 17 '20 07:07 Fil

An example here https://observablehq.com/@didoesdigital/16-july-2020-data-wrangling-for-population-pyramids ; in that case the "best" strategy, it seems, is a flatMap https://observablehq.com/@didoesdigital/16-july-2020-data-wrangling-for-population-pyramids#pyramid

Fil avatar Jul 17 '20 13:07 Fil

If I were to write the relig_income example in vanilla JavaScript, I’d probably use array.flatMap like so:

data.columns.slice(1).flatMap(income => data.map(({religion, [income]: count}) => ({religion, income, count})))

Regarding the inverse operation (long to wide), is there a more elegant alternative to using d3.groups, array.map and array.reduce?

d3.groups(data, d => d.religion)
  .map(([religion, x]) => {
    return {
      religion,
      ...x.reduce((acc, { income, count }) => {
        acc[income] = count;
        return acc;
      }, {}),
    };
  });

https://observablehq.com/d/faa7e77aa71c7031

nachocab avatar Jan 17 '21 12:01 nachocab

@nachocab Can you enable link sharing on the notebook so we can see?

mbostock avatar Jan 17 '21 18:01 mbostock

Here’s another take of the inverse operation, replacing array.map with Array.from, and replacing array.reduce with Object.fromEntries:

Array.from(
  d3.group(data, d => d.religion),
  ([religion, group]) => Object.fromEntries(
    [["religion", religion]].concat(
      group.map(d => [d.income, d.count])
    )
  )
)

mbostock avatar Jan 17 '21 18:01 mbostock

@mbostock That's beautiful! Thank you for helping me understand those functions more deeply and for pointing out the link sharing bit. I'll remember it for next time. 👍

nachocab avatar Jan 17 '21 19:01 nachocab

Just moved to data visualisation and realised I'm a noop with respect to data manipulation... My conversions from Sqlite based normalised long data to wide was uhhh... less then optimal (to put it mildly) :-(

Build-in long/wide reshape functions would be very welcome.

Btw. thanks for this incredible library!

pamtbaau avatar Sep 19 '21 15:09 pamtbaau