optd
optd copied to clipboard
feat: automatically converting csv to parquet
Summary: Automatically converting CSV to Parquet before generating stats on the Parquet files.
Demo:
Details:
- For robustness, we don't use schema inference. We build a temporary DataFusion context, create the tables with the DDL statements, and then get the schema from DataFusion.
- I forked csv2parquet here. One notable change is that it's now a library instead of a binary. Also, we turn empty strings for nullable Utf8 columns into nulls in-memory, because arrow's CSV reader doesn't seem to do this for Utf8 types. This has a huge effect on q-error on JOB.