twarc-csv icon indicating copy to clipboard operation
twarc-csv copied to clipboard

Parquet output format

Open igorbrigadir opened this issue 4 years ago • 5 comments

Instead of CSVs, append the parsed dataframes to parquet https://stackoverflow.com/a/47839247/11090908

igorbrigadir avatar Aug 20 '21 18:08 igorbrigadir

Being able to output as parquet would be nice too--even if it's called twarc-csv :-)

edsu avatar Aug 20 '21 20:08 edsu

Yeah I'm actually considering a different command as an alias, just for it to make semantic sense / good docs, so these would be the same:

twarc2 dataframe --output-format parquet input.json output.parquet

twarc2 csv --output-format parquet input.json output.parquet

But not sure how useful that is. It'll purely be an alias for a docs entry and for the command line.

igorbrigadir avatar Aug 21 '21 11:08 igorbrigadir

I was going to say that pandas has many output formats. It might not be hard to add parquet, pickle, hdf, sql, excel, json, html, feather, latex, stata, gbq, markdown, ... :-) but like you said, figuring out the api is the hard part.

edsu avatar Aug 21 '21 14:08 edsu

Yeah - still figuring out that part!

igorbrigadir avatar Aug 22 '21 05:08 igorbrigadir

Still haven't figured this out, but for now, you can use DataFrameConverter to get a python DataFrame object which you can convert yourself. I'll keep this open for implementing the actual command later.

Maybe an alias?

twarc2 dataframe input.jsonl output.parquet

or

twarc2 dataframe --output-format parquet input.jsonl output.parquet

or

twarc2 csv --output-format parquet input.jsonl output.parquet

igorbrigadir avatar Oct 24 '21 12:10 igorbrigadir