arkouda
arkouda copied to clipboard
Inconsistent parquet read/write times
Users have noticed that parquet write times can be wildly inconsistent, like a few seconds or a few minutes for the same type of write task
User provided summary of the problem:
I've noticed that when writing large datasets with parquet, the per-locale files don't all get written in parallel like I'd expect. Instead a few of the files are created and grow to their full size, then a few more are created and grow, then a few more, and so on. The overall write operation takes much longer than if all the files were written in parallel. In arkouda's chapel code, the
write1DDistStringsAggregators
function (used for writing string columns) has a "gather" step to copy string data from other nodes into local memory before calling the Apache Arrow library function to actually write the parquet file. I don't know chapel, but I'm guessing that when a process is running non chapel code like the arrow library calls, it might not respond to requests from sibling processes until it returns to the running chapel code. If so that creates a race condition: when a node finishes gathering its string data and calls arrow to write the file, it stops responding to other nodes which are still gathering strings and need data from this one; those processes end up having to wait until this node finishes writing its files (and returns to chapel) before they can start writing theirs. I could be completely wrong about this, but it's plausible and explain the behavior I've observed. This could be resolved by putting a barrier step in between gathering and writing to ensure all nodes have finished gathering remote string data before nay arrow calls occur. (I assume Chapel supports barriers, since they're a standard concurrency primative). I've also noticed a large variation in time when reading large datasets from Chapel -- reading the same data is sometimes much slower than other times, like 7 minutes instead of 1. Maybe there's something similar happening on the arrow calls to load from Parquet?
Do we know if these were dataframes or individual string columns being written?