scio icon indicating copy to clipboard operation
scio copied to clipboard

Improve file metadata lookup in Parquet SDF

Open clairemcginty opened this issue 10 months ago • 0 comments

When using the new Parquet SplittableDoFn implementation to read a large # of files, the file metadata lookup (required to break down individual files into parallelizable row groups) can be a performance bottleneck because it's pretty much single threaded+sequential: if you look at the worker graph, you'll see a single worker just doing metadata lookups for 10-20 min before the actual splitting operations kick in. Using the ParquetReadConfiguration.SplitGranularityFile option can remediate this, but at the cost of available parallelism

Can we improve this? Some ideas:

  1. Simplest -- just do file lookups in parallel.
  2. Introduce an option like ParquetReadConfiguration.UseEstimatedRowGroupSize -- basically, instead of reading every file's metadata, we can just sample a few files, and use their average value to extrapolate the rest.
  3. Write some kind of a manifest file/metastore entry that maps individual files --> [# row groups, group byte size]

clairemcginty avatar Sep 05 '23 13:09 clairemcginty