Update I/O for parquet/etc. using pyarrow

Open sidneymau opened this issue 2 months ago • 1 comments

I implemented a Reader that leverages pyarrow datasets. There are a few benefits to this:

More direct handling of parquet files than with pandas
Potential support for files in arrow, csv, json, orc in addition to parquet (I didn't test this but it should work in principle)
Support for "files" that are actually directories full of many parquet files (but which you want to treat as one large source of data) and support for directory partitioning (e.g., if there is a directory for each healpixel, then the directory structure can be parsed as a column for performing selections/etc.). Note that this is not otherwise supported by TreeCorr at the moment, so some changes would need to be made in catalog.py to leverage this

I mostly copied over the parquet reader tests for the arrow reader tests. Running test_reader.py, I get the following output:

time for test_fits_reader = 0.23
time for test_hdf_reader = 0.03
names =  {'KAPPA', 'GAMMA1', 'MU', 'RA', 'INDEX', 'GAMMA2', 'Z', 'DEC'}
names =  {'KAPPA', 'GAMMA1', 'MU', 'RA', 'INDEX', 'GAMMA2', 'Z', 'DEC'}
time for test_parquet_reader = 0.27
time for test_ascii_reader = 0.01
time for test_pandas_reader = 0.05
names =  {'KAPPA', 'GAMMA1', 'MU', 'RA', 'INDEX', 'GAMMA2', 'Z', 'DEC'}
names =  {'KAPPA', 'GAMMA1', 'MU', 'RA', 'INDEX', 'GAMMA2', 'Z', 'DEC'}
time for test_arrow_reader = 0.03

Comparing the arrow reader to the current parquet reader, the performance appears to be much better though this is of course not a systematic comparison.

edit: the performance disparity is in part a result of caching behavior. Running the arrow reader before the parquet reader results in both taking ~0.06 seconds (which does suggest the arrow reader still performs much better on unseen data)

Oct 29 '25 19:10 sidneymau