akimbo icon indicating copy to clipboard operation
akimbo copied to clipboard

Is `explode` working properly?

Open jpivarski opened this issue 1 year ago • 16 comments

I'm forwarding this Gitter question from @gozmit97 to make sure that it doesn't scroll away without getting answered. It sounds like it could be an issue.


I'm trying to expand the arrays in the columns of my dataframe, in the style of this stackoverflow page using explode. Using a sample code like the following (converted to Awkward arrays yo match my data):

df = pd.DataFrame.from_records(
    data=ak.Array(np.random.random((2, 3, 2)).round(3)),
).add_prefix("column")

print(df)

I'm able to explode all my columns using [df[col].explode(ignore_index=True) for col in df] as wanted. However, running it on my own data seems to do nothing. Upon a bit of snooping around, the only noticeable difference I've been able to find between the series in sample code above (say in column2) and in my data is that my data has dtype=Awkward and the sample code has dtype=object. (see below for samples, first being sample code and second being my image with changed data)

0    [0.675, 0.485]
1    [0.317, 0.865]
Name: column2, dtype: object

0  [123.456, 789.000]
Name: A, dtype: awkward

If you have any ideas as to what may be happening here or suggestions on how I might hope to turn each of my rows with arrays into one row for each element in that array in another manner, do let me know. pd.explode worked prior but after changing how I saved and loaded my data it stopped working.

jpivarski avatar Aug 18 '23 17:08 jpivarski

Looks like support for steering Series.explode from an extension array was only recently added to pandas! https://github.com/pandas-dev/pandas/pull/53602 Even then it looks like it's not extensively documented and was only added to Arrow types and not the extension array interface in general 🤔

In the latest released version of pandas (2.0.3) the Series.explode method will just return a copy of itself if is_object_dtype(s) returns False (which in the case of s.dtype == "awkward" we are not object dtype).

We can add the necessary method (_explode) to our extension array, but Series.explode won't dispatch to it. A workaround would be to convert the awkward type to an arrow type and then do the explode. (as of right now this would only work if you were working from the HEAD of the pandas repository)

Finally, we can also submit a patch upstream to pandas to see if we can get Series.explode to support any extension type

douglasdavis avatar Aug 18 '23 18:08 douglasdavis

An update here: opened https://github.com/pandas-dev/pandas/pull/54834 which has potential to go into pandas 2.2.0

douglasdavis avatar Aug 31 '23 15:08 douglasdavis

Hi @douglasdavis thanks for this. I've run up against the same problem today, with a ROOT dataset imported using uproot. I see:

df['x'] ... Name: x, Length: 100, dtype: awkward

Can you clarify the process to 'convert to an arrow type and then do the explode'? Only some of my columns are awkward arrays...

CrfzdPQM6 avatar Apr 06 '24 14:04 CrfzdPQM6

Here's a trivial example. It starts with a simple root dataset (root -q mymacro.C)

void mymacro() {
  TFile *f = new TFile("myfile.root","recreate");
  TTree *t = new TTree("mytree", "mytree");
  std::vector<float> v_pt({0.1,0.2,0.3});

  auto branch = t->Branch("pt", &v_pt);

  for (int i =0 ; i < 3; i++)
  {
    v_pt.clear();
    v_pt.push_back(0.1);
    v_pt.push_back(0.2);
    v_pt.push_back(0.3);
    t->Fill();
  }
  t->Write();
  f->Close();
}

then try using uproot to read the thing back: screenshot_2024-04-08_18-24-35_206802240

CrfzdPQM6 avatar Apr 08 '24 17:04 CrfzdPQM6

Is there any easy way to make this work? Thanks so much in advance!

CrfzdPQM6 avatar Apr 08 '24 17:04 CrfzdPQM6

I should say, I'm currently on pandas 2.0.0

CrfzdPQM6 avatar Apr 08 '24 17:04 CrfzdPQM6

Can you perhaps phrase this in terms of awkward and pandas alone, so that we can make a simple test function of expected functionality?

Is https://github.com/intake/awkward-pandas/pull/46 perhaps exactly what you need?

martindurant avatar Apr 08 '24 17:04 martindurant

Was just going to add I see exactly the same behaviour with pandas 2.2.1. I'll have a crack at rephrasing this without root screenshot_2024-04-08_18-30-56_092084025

CrfzdPQM6 avatar Apr 08 '24 17:04 CrfzdPQM6

@martindurant https://github.com/intake/awkward-pandas/pull/46 looks very relevant, but I'm not really sure how to apply it to my dataframe to enable the explosion

CrfzdPQM6 avatar Apr 08 '24 17:04 CrfzdPQM6

You would need to install from that branch, and it should "just work". At least, I think - from the screenshots, I'm not certain what the expected output would be.

martindurant avatar Apr 08 '24 17:04 martindurant

Interesting. This works fine (with pandas 2.2.1) if I create the dataframe of awkward arrays myself: screenshot_2024-04-08_18-38-47_669933957

CrfzdPQM6 avatar Apr 08 '24 17:04 CrfzdPQM6

But the difference is that pt now gets a dtype object instead of awkward

CrfzdPQM6 avatar Apr 08 '24 17:04 CrfzdPQM6

Given that the output is numbers, its type should probably be int/float. I suppose type awkward would be OK too and maybe the easiest to apply consistently. Can you please include your snippet in the PR, we'll make it a test and make sure we fix it.

martindurant avatar Apr 08 '24 17:04 martindurant

Very sorry to ask a potentially dump question, @martindurant , but how do I install from that branch? Would that be something like:

pip install git+https://github.com/douglasdavis/awkward-pandas/tree/dev-explode

?

CrfzdPQM6 avatar Apr 08 '24 17:04 CrfzdPQM6

OK I figured out how to do it (pip install git+https://github.com/douglasdavis/awkward-pandas@dev-explode), and here is the result! screenshot_2024-04-08_18-52-01_937420805

CrfzdPQM6 avatar Apr 08 '24 17:04 CrfzdPQM6

Thanks a lot for the pointers, @martindurant !!!

CrfzdPQM6 avatar Apr 08 '24 17:04 CrfzdPQM6