croissant icon indicating copy to clipboard operation
croissant copied to clipboard

contentUrl for each format of a file (original proprietary vs archival)

Open pdurbin opened this issue 1 year ago • 4 comments

For some proprietary file formats such as Excel, Stata, and SPSS, Dataverse creates non-proprietary formats (TSV and RData) of uploaded files for archival purposes. In addition, a TSV version might be good enough for a researcher who doesn't have the proprietary software installed.

Our Croissant output is still a work in progress but for now I'm favoring the original, proprietary format under contentUrl like this:

{
    "@type": "cr:FileObject",
    "@id": "stata13-auto.dta",
    "name": "stata13-auto.dta",
    "encodingFormat": "application/x-stata-13",
    "md5": "7b1201ce6b469796837a835377338c5a",
    "contentSize": "6443 B",
    "contentUrl": "http://localhost:8080/api/access/datafile/6?format=original"
}

However, if I wanted to advertise that non-proprietary formats (TSV and RData) are available as well, what's the best practice in Croissant?

Would each format be another FileObject? In our UI (below), we show a single file with multiple download options but maybe from the Croissant perspective these formats would be better represented as different files? They would have different checksums and sizes, after all. 🤔

Screenshot 2024-04-29 at 4 35 26 PM

pdurbin avatar Apr 29 '24 20:04 pdurbin

I would lean towards representing them as separate FileObjects. If we want to preserve the connection between them, we could specify sc:encoding pointing from the original file to each of the alternative formats. Would that make sense?

benjelloun avatar May 17 '24 16:05 benjelloun

@benjelloun sort of? Is there a concrete example in the examples at https://github.com/mlcommons/croissant/tree/v1.0.5/datasets/1.0 ? I looked quickly but couldn't find one.

Either way, it sounds like this could be something we add later. I think most of our users will be happy with the original file.

pdurbin avatar May 17 '24 18:05 pdurbin

"encodingFormat": "application/x-stata-13",

While testing https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker as part of https://github.com/IQSS/dataverse/issues/11462 I realized that Stata files are not supported. Here's a comment to focus on: https://github.com/IQSS/dataverse/issues/11462#issuecomment-2894744955

If I go to https://github.com/mlcommons/croissant/blob/v1.0.17/python/mlcroissant/mlcroissant/_src/operation_graph/operations/read.py#L124 I see support for CSV, TSV, JSON, Parquet, etc., but no Stata.

From a quick test, Stata is supported by Pandas:

import pandas as pd

df = pd.read_stata(file_path)
print(df.head())

This is a bit off topic for this issue but I thought I'd mention it, at least! 😄

pdurbin avatar May 21 '25 20:05 pdurbin