uproot5 icon indicating copy to clipboard operation
uproot5 copied to clipboard

Performance issues with concatenate() on large datasets with multiple files

Open KeanuGh opened this issue 4 years ago • 3 comments

Hi there. I am attempting to pull all branches of a tree in a root dataset of about 20 files of ~6GB total into a pandas dataframe. The machine I am using has 64G of memory so I assumed it would be able to handle this no problem. However I am about 30 minutes into the concatenate command and currently sat around ~55GB of memory used. Do you have any advice as to making handling large datasets in uproot more performant? I am using a chunksize of 1024.

Platform: Arch linux Python version: 3.9.7 Uproot version; 4.1.8

Edit: after around 45 minutes the script exited with sigkill 9 so I assume it hit the memory limit and crashed

KeanuGh avatar Nov 20 '21 11:11 KeanuGh

We've struggled with the performance of Pandas conversations before. If the data types are not numeric or singly jagged-numeric, then just the interpretation before getting to Pandas is going to be slow (until we start using AwkwardForth next summer). To see if it's the interpretation or the Pandas conversion, try reading the data into Awkward Arrays (no library option). It's probably not the concatenation, but you can test that, too, by using iterate instead of concatenate. If it's the Pandas conversion, Awkward Arrays have an ak.to_pandas function that is independent of Uproot's, and it might be faster. (Uproot's can't assume the existence of Awkward Array, which constrains it more.)

I hope these suggestions help! Oh, and using a lot of memory on a machine that has a lot of memory is not unusual—Python won't clean up intermediate objects until it has to, so the memory used doesn't represent the data that would remain after a garbage collection pass. However, the fact that there are so many intermediate objects and it's taking this long suggests to me that O(n) Python objects are being created (where n is the length of the array), which only happens if the data type is not one of the "numeric or singly jagged-numeric" that are castable without intermediate Python from the way that ROOT files are structured. The most immediate thing is, what's the data type?

jpivarski avatar Nov 20 '21 12:11 jpivarski

Thank you for the advice! going from awkward array to pandas is actually a good 20% faster when not importing all branches. Beyond that, I found that a few of the branches are actually vectors! which are able to be read into awkward arrays without an issue but hit memory limit when attempting to read directly into pandas. Omitting these branches let me read all the others in no problem.

KeanuGh avatar Nov 21 '21 14:11 KeanuGh

In Pandas, the vectors have to become Python lists, because a Pandas cell has to be something with a NumPy-like dtype (the set of Pandas dtypes is a little broader than NumPy's, but only adding things decimal number formats), and that doesn't include nested lists, unless you go to dtype="O", arbitrary Python objects.

If the branches have the same jagged structure as each other, the numerical values are placed in the cells and the jagged structure is represented by a MultiIndex. Awkward Array's to_pandas can go arbitrarily many levels of nesting deep, but I don't remember if Uproot's can. More than one level deep is stored in a completely different way in ROOT, a way that isn't favorable to accessing as a column. (That's what "AwkwardForth" intends to address: it's a fast mini-language for iterating over non-columnar data and turning it into columns. We're preparing for a major project next summer to switch all of Uproot's non-columnar handling to AwkwardForth.)

jpivarski avatar Nov 22 '21 12:11 jpivarski

Re-reading this two years later, it looks like it's a solved problem, so I'm closing it. Thanks for reporting!

jpivarski avatar Oct 05 '23 14:10 jpivarski