crawlee icon indicating copy to clipboard operation
crawlee copied to clipboard

Export Dataset to Parquet

Open UsAndRufus opened this issue 2 years ago • 5 comments

Which package is the feature request for? If unsure which one to select, leave blank

@crawlee/core

Feature

CSV and JSON are not well-defined formats, both in writing and parsing. CSV is a particularly a problem. Both face issues with interpreting quotes, newlines, and commas. This leads to downstream issues where data is interpreted as being in the wrong column because of format and parser inconsistencies.

Parquet is a great data format that effectively solves these problems. It's also generally a lot smaller than CSV & JSON.

Motivation

I have recently had to fix a load of bugs caused by CSV being inconsistent. The way Crawlee saves CSV is not always immediately intepretable by Spark. I have to handle a lot of edge cases and ensure that newlines are stripped, quotemarks are correct, commas are stripped, etc.

Ideal solution or implementation, and any additional constraints

Dataset.exportToParquet would be ideal

Alternative solutions or implementations

Converting to CSV/JSON first, then using an npm module to convert to Parquet. Or iterating through a dataset and adding to Parquet row-by-brow.

Other context

No response

UsAndRufus avatar Feb 03 '23 13:02 UsAndRufus

Interesting, first time hearing about parquet. I am not sure if we want to support more formats natively, but we could make the exports extensible so you could do it yourself. Tried to look at the tooling, and haven't found a single package that would be maintained, or dependency-free, but I can't find anything, so that would be another problem.

Btw I'd be curious to hear what problems you have with JSON, as what you described (quotes, newlines, and commas) is just about correct escaping. JSON has its cons, but the ones you mentioned seem off. I can relate to CSV issues, but first time hearing some complaining about JSON this way :]

B4nan avatar Feb 03 '23 14:02 B4nan

pyarrow has Parquet support bundled.

LeMoussel avatar Feb 04 '23 07:02 LeMoussel

@B4nan thanks for looking at the request! I agree a lot of the issues with CSV are resolved by JSON. I think on this project we may switch from CSV to JSON until a Parquet solution is feasible. It's not ideal as it's not tabular, but we can have our ingestion code handle that.

I have still had issues with parsers not interpreting JSON properly, though. Nested objects in particular can be an issue, as are quotes. Of course, it depends on the quality of the code doing the JSON writes and reads, but IME some of the edge cases are poorly-defined. JSON files are also text-based so generally a lot larger for transport, which can be an issue.

Re tooling for Parquet, thanks for the suggestion @LeMoussel. It is a newer format, and more oriented towards the ML/data science world. I have only recently become acquainted with it, but now I seem to see it everywhere. It seems like a good for Crawlee, as I imagine many people are doing what we're doing and pushing their scrape results into some of data science process.

I think making the export process extensible would be a great first step. Then we could look at building out some custom exporters. If those do well, are well maintained, and fit into the toolchain well, then Crawlee could look at integrating them into the main package :)

UsAndRufus avatar Feb 06 '23 15:02 UsAndRufus

I'm using csv-writer to export my data in csv. I'm also interested in parquet, tried all the possible parquet writers from NPM... for some reason the output is waaay bigger than csv or json, and they don't have any compression settings available. For the moment I didn't find nothing easy to implement. One solution would be : trying to implement a part of the node version of polars.

dragospopa420 avatar Mar 24 '23 11:03 dragospopa420

I also believe parquet is to optimal format to export datasets. Smaller in size and solves all the issues mentioned above related to CSV. I would be happy to contribute in making this available once the export is extensible.

etiennecl avatar Jun 30 '23 12:06 etiennecl