frawk
frawk copied to clipboard
support for parquet files
This might sound crazy but still I wanted to propose a feature request about parquet files.
You might ask, why? Parquet files are becoming more widespread and might even be considered as "the new csv". There are specialized tools such as duckdb to run sql commands on them. I didn't see or came across any awk-like utility which can process parquet files.
IMHO, supporting parquet files by frawk will be a huge win for "data analysis at the commandline" camp.
I don't think this is unreasonable at all (which isn't to say it will be easy :)) ! I think something like Parquet would be pretty interesting to support with frawk. I may repurpose this issue for general "supporting complex datatypes with arbitrary nesting" but I think Parquet should definitely be in the picture, along with Arrow and maybe JSON eventually. This will take some time, and I will probably start on some "easier" work around fixing the parser first but I will definitely keep this issue open. Thanks for the suggestion.
I'm sure the "nested data" part will be a headache, maybe initially not-nested files are supported..
time flies, and now I use duckdb to convert Parquet to CSV.
$ duckdb -c "COPY (select * from 'family.parquet') TO 'query.csv' (FORMAT CSV)"
Now lots of tools support Parquet to CSV: Polars, DuckDB, ClickHouse local, GlareDB etc.