frawk icon indicating copy to clipboard operation
frawk copied to clipboard

support for parquet files

Open alperyilmaz opened this issue 2 years ago • 3 comments

This might sound crazy but still I wanted to propose a feature request about parquet files.

You might ask, why? Parquet files are becoming more widespread and might even be considered as "the new csv". There are specialized tools such as duckdb to run sql commands on them. I didn't see or came across any awk-like utility which can process parquet files.

IMHO, supporting parquet files by frawk will be a huge win for "data analysis at the commandline" camp.

alperyilmaz avatar Jun 26 '22 23:06 alperyilmaz

I don't think this is unreasonable at all (which isn't to say it will be easy :)) ! I think something like Parquet would be pretty interesting to support with frawk. I may repurpose this issue for general "supporting complex datatypes with arbitrary nesting" but I think Parquet should definitely be in the picture, along with Arrow and maybe JSON eventually. This will take some time, and I will probably start on some "easier" work around fixing the parser first but I will definitely keep this issue open. Thanks for the suggestion.

ezrosent avatar Jun 27 '22 02:06 ezrosent

I'm sure the "nested data" part will be a headache, maybe initially not-nested files are supported..

alperyilmaz avatar Jun 27 '22 09:06 alperyilmaz

time flies, and now I use duckdb to convert Parquet to CSV.

$ duckdb -c "COPY (select * from 'family.parquet') TO 'query.csv' (FORMAT CSV)"

Now lots of tools support Parquet to CSV: Polars, DuckDB, ClickHouse local, GlareDB etc.

linux-china avatar Mar 12 '24 03:03 linux-china