crawlee
crawlee copied to clipboard
Export Dataset to Parquet
Which package is the feature request for? If unsure which one to select, leave blank
@crawlee/core
Feature
CSV and JSON are not well-defined formats, both in writing and parsing. CSV is a particularly a problem. Both face issues with interpreting quotes, newlines, and commas. This leads to downstream issues where data is interpreted as being in the wrong column because of format and parser inconsistencies.
Parquet is a great data format that effectively solves these problems. It's also generally a lot smaller than CSV & JSON.
Motivation
I have recently had to fix a load of bugs caused by CSV being inconsistent. The way Crawlee saves CSV is not always immediately intepretable by Spark. I have to handle a lot of edge cases and ensure that newlines are stripped, quotemarks are correct, commas are stripped, etc.
Ideal solution or implementation, and any additional constraints
Dataset.exportToParquet
would be ideal
Alternative solutions or implementations
Converting to CSV/JSON first, then using an npm module to convert to Parquet. Or iterating through a dataset and adding to Parquet row-by-brow.
Other context
No response
Interesting, first time hearing about parquet. I am not sure if we want to support more formats natively, but we could make the exports extensible so you could do it yourself. Tried to look at the tooling, and haven't found a single package that would be maintained, or dependency-free, but I can't find anything, so that would be another problem.
Btw I'd be curious to hear what problems you have with JSON, as what you described (quotes, newlines, and commas) is just about correct escaping. JSON has its cons, but the ones you mentioned seem off. I can relate to CSV issues, but first time hearing some complaining about JSON this way :]
pyarrow has Parquet support bundled.
@B4nan thanks for looking at the request! I agree a lot of the issues with CSV are resolved by JSON. I think on this project we may switch from CSV to JSON until a Parquet solution is feasible. It's not ideal as it's not tabular, but we can have our ingestion code handle that.
I have still had issues with parsers not interpreting JSON properly, though. Nested objects in particular can be an issue, as are quotes. Of course, it depends on the quality of the code doing the JSON writes and reads, but IME some of the edge cases are poorly-defined. JSON files are also text-based so generally a lot larger for transport, which can be an issue.
Re tooling for Parquet, thanks for the suggestion @LeMoussel. It is a newer format, and more oriented towards the ML/data science world. I have only recently become acquainted with it, but now I seem to see it everywhere. It seems like a good for Crawlee, as I imagine many people are doing what we're doing and pushing their scrape results into some of data science process.
I think making the export process extensible would be a great first step. Then we could look at building out some custom exporters. If those do well, are well maintained, and fit into the toolchain well, then Crawlee could look at integrating them into the main package :)
I'm using csv-writer to export my data in csv. I'm also interested in parquet, tried all the possible parquet writers from NPM... for some reason the output is waaay bigger than csv or json, and they don't have any compression settings available. For the moment I didn't find nothing easy to implement. One solution would be : trying to implement a part of the node version of polars.
I also believe parquet is to optimal format to export datasets. Smaller in size and solves all the issues mentioned above related to CSV. I would be happy to contribute in making this available once the export is extensible.