optd icon indicating copy to clipboard operation
optd copied to clipboard

feat: automatically converting csv to parquet

Open wangpatrick57 opened this issue 9 months ago • 0 comments

Summary: Automatically converting CSV to Parquet before generating stats on the Parquet files.

Demo: Screenshot 2024-05-01 at 18 58 27

Details:

  • For robustness, we don't use schema inference. We build a temporary DataFusion context, create the tables with the DDL statements, and then get the schema from DataFusion.
  • I forked csv2parquet here. One notable change is that it's now a library instead of a binary. Also, we turn empty strings for nullable Utf8 columns into nulls in-memory, because arrow's CSV reader doesn't seem to do this for Utf8 types. This has a huge effect on q-error on JOB.

wangpatrick57 avatar May 01 '24 22:05 wangpatrick57