dataverse icon indicating copy to clipboard operation
dataverse copied to clipboard

incorporate parquet files?

Open wibeasley opened this issue 2 years ago • 2 comments

Has there been any discussion of using parquet files at some level of dataverse? (I see it mentioned in only one issue.)

Apache Parquet is an open source, column-oriented data file format designed for efficient data storage and retrieval. It provides efficient data compression and encoding schemes with enhanced performance to handle complex data in bulk. Apache Parquet is designed to be a common interchange format for both batch and interactive workloads.

I've used them some, and I love how they work well with R, Python, DuckDb, Spark, and others.

Several R programmers (like @kuriwaki) have advocated for rds files over RData files. From my recent experience with parquet files, they have all the advertised advantages of rds files (eg, compression, strong-typing, and factor levels), plus the appeal of interoperability with other platforms.

I haven't thought much beyond this. But when I read about problems with RData files and the messiness of Rserve described by @landreev, I see parquet as a improvement for many reasons --not least is the ability to replace a flaky remote instance with a local parquet library.

cc: @pdurbin

wibeasley avatar Sep 08 '23 19:09 wibeasley

Yes! Recently 2020 data from the US Census was published in Harvard Dataverse in parquet format: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/5LAVKV

(2010 data was published as well: https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/1OR2A6 )

This work is with the US Census is ongoing and being tracked here:

  • https://github.com/IQSS/dataverse.harvard.edu/issues/218

That said, no, Dataverse doesn't have any particular support for parquet files. In the examples above the parquet files are in a zip file. Here's a preview of the 2020 zip:

Screen Shot 2023-09-08 at 3 49 20 PM

pdurbin avatar Sep 08 '23 19:09 pdurbin

One note here is that dataverse does not seem to "Unzip" a compressed parquet collection in a way that respects the file hierarchy. In this example I just made it says it "failed to unzip the file properly". The file itself is still intact: the user can unzip it themselves after downloading, and Phil's screenshot above shows that the file hierarchy can be viewed as metadata, which may be the best way forward. But just a note.

https://github.com/IQSS/dataverse/assets/21006/8e308a48-26d3-4bd8-8ae5-5d87a2f0d6e9

kuriwaki avatar Jul 02 '24 15:07 kuriwaki

The parquet I'm working with now does expand out the partitions (subfolders) properly, but it is converting the "=" in the subfolder name to "." and won't let me change it back to a "=" (error message below). This leads to a corrupt parquet file when someone downloads it. The subdirectory must contain the "=". I wonder if such an exception can be allowed?

image

kuriwaki avatar Oct 09 '24 03:10 kuriwaki

@kuriwaki that's unfortunate! Would you be able to create a dedicated issue for this (about = changing to . in the file path) and upload a small parquet file we can test with?

pdurbin avatar Oct 09 '24 14:10 pdurbin