contentUrl for each format of a file (original proprietary vs archival)
For some proprietary file formats such as Excel, Stata, and SPSS, Dataverse creates non-proprietary formats (TSV and RData) of uploaded files for archival purposes. In addition, a TSV version might be good enough for a researcher who doesn't have the proprietary software installed.
Our Croissant output is still a work in progress but for now I'm favoring the original, proprietary format under contentUrl like this:
{
"@type": "cr:FileObject",
"@id": "stata13-auto.dta",
"name": "stata13-auto.dta",
"encodingFormat": "application/x-stata-13",
"md5": "7b1201ce6b469796837a835377338c5a",
"contentSize": "6443 B",
"contentUrl": "http://localhost:8080/api/access/datafile/6?format=original"
}
However, if I wanted to advertise that non-proprietary formats (TSV and RData) are available as well, what's the best practice in Croissant?
Would each format be another FileObject? In our UI (below), we show a single file with multiple download options but maybe from the Croissant perspective these formats would be better represented as different files? They would have different checksums and sizes, after all. 🤔
I would lean towards representing them as separate FileObjects. If we want to preserve the connection between them, we could specify sc:encoding pointing from the original file to each of the alternative formats. Would that make sense?
@benjelloun sort of? Is there a concrete example in the examples at https://github.com/mlcommons/croissant/tree/v1.0.5/datasets/1.0 ? I looked quickly but couldn't find one.
Either way, it sounds like this could be something we add later. I think most of our users will be happy with the original file.
"encodingFormat": "application/x-stata-13",
While testing https://huggingface.co/spaces/JoaquinVanschoren/croissant-checker as part of https://github.com/IQSS/dataverse/issues/11462 I realized that Stata files are not supported. Here's a comment to focus on: https://github.com/IQSS/dataverse/issues/11462#issuecomment-2894744955
If I go to https://github.com/mlcommons/croissant/blob/v1.0.17/python/mlcroissant/mlcroissant/_src/operation_graph/operations/read.py#L124 I see support for CSV, TSV, JSON, Parquet, etc., but no Stata.
From a quick test, Stata is supported by Pandas:
import pandas as pd
df = pd.read_stata(file_path)
print(df.head())
This is a bit off topic for this issue but I thought I'd mention it, at least! 😄