1.0 as a string should be a valid version for a dataset
Dataverse uses MAJOR.MINOR for dataset versions like 1.0, 1.1, 2.0. I believe this should be valid. I'm aware that the validator wants MAJOR.MINOR.PATCH (output below) and that under the Version section, the Croissant 1.0 spec says the same (Semantic Versioning 2.0.0).
I'm also aware that 1.0 as number instead of a string is considered valid (thanks @goeffthomas for pointing this out), but I don't see this in the spec and I think versions should be strings (1.0.0 isn't a number anyway). I see from #593 and #594 that just 1 as a number is also valid but this isn't a solution for Dataverse.
Below is the warning I see as of 1.0.3 from the validator when testing my work-in-progress Croissant file (also below). The relevant line in the JSON is, of course, the following:
"version": "1.0",
So! Can we consider 1.0 valid?
Thanks for your consideration!
$ mlcroissant validate --jsonld croissant.json
W0315 16:45:48.010244 140704470599424 datasets.py:32] Found the following 1 warning(s) during the validation:
- [Metadata(Cars)] Version doesn't follow MAJOR.MINOR.PATCH: 1.0. For more information refer to: https://semver.org/spec/v2.0.0.html
I0315 16:45:48.010346 140704470599424 validate.py:53] Done.
$
$ cat croissant.json
{
"@context": {
"@language": "en",
"@vocab": "https://schema.org/",
"citeAs": "cr:citeAs",
"column": "cr:column",
"conformsTo": "dct:conformsTo",
"cr": "http://mlcommons.org/croissant/",
"data": {
"@id": "cr:data",
"@type": "@json"
},
"dataBiases": "cr:dataBiases",
"dataCollection": "cr:dataCollection",
"dataType": {
"@id": "cr:dataType",
"@type": "@vocab"
},
"dct": "http://purl.org/dc/terms/",
"extract": "cr:extract",
"field": "cr:field",
"fileProperty": "cr:fileProperty",
"fileObject": "cr:fileObject",
"fileSet": "cr:fileSet",
"format": "cr:format",
"includes": "cr:includes",
"isEnumeration": "cr:isEnumeration",
"jsonPath": "cr:jsonPath",
"key": "cr:key",
"md5": "cr:md5",
"parentField": "cr:parentField",
"path": "cr:path",
"personalSensitiveInformation": "cr:personalSensitiveInformation",
"recordSet": "cr:recordSet",
"references": "cr:references",
"regex": "cr:regex",
"repeated": "cr:repeated",
"replace": "cr:replace",
"sc": "https://schema.org/",
"separator": "cr:separator",
"source": "cr:source",
"subField": "cr:subField",
"transform": "cr:transform",
"wd": "https://www.wikidata.org/wiki/"
},
"@type": "sc:Dataset",
"conformsTo": "http://mlcommons.org/croissant/1.0",
"name": "Cars",
"version": "1.0",
"citeAs": "https://doi.org/10.5072/FK2/EKY1NP",
"distribution": [
{
"@type": "cr:FileObject",
"name": "stata13-auto.dta",
"encodingFormat": "application/x-stata-13",
"md5": "7b1201ce6b469796837a835377338c5a",
"sha256": "see md5",
"contentUrl": "stata13-auto.dta"
}
],
"license": "http://creativecommons.org/publicdomain/zero/1.0",
"datePublished": "2024-03-13",
"dateModified": "2024-03-13"
}
Over at https://github.com/gdcc/dataverse-exporters/commit/3e95c977722e1b83a3b584bbecc4e77bc628c60f I pushed a commit to simply append ".0" to the version of Dataverse datasets, but I'm pretty grumpy about it because it's really suboptimal for us. Here's the comment I left in the code:
- We append ".0" to our version string to make the Croissant
- validator happy and to comply with the Croissant spec but we
- would rather not.
- Here's what the spec says:
- "The recommended versioning scheme to use for datasets is
- MAJOR.MINOR.PATCH, following Semantic Versioning 2.0.0." --
- https://mlcommons.github.io/croissant/docs/croissant-spec.html#version
- If a Dataverse user sees 1.0.0 they will hopefully know that they
- must remove the final ".0" to see the actual version of the
- dataset they want. Note that as of Dataverse 6.1, if you navigate
- to 1.0.0 you will be redirected to the latest published version,
- which is possibly quite different than version 1.0 of the
- dataset! For example you will be redirected to version 4.0 (as of
- this writing) if you try to navigate to
- https://dataverse.harvard.edu/dataset.xhtml?persistentId=doi:10.7910/DVN/TJCLKP&version=1.0.0
- This is suboptimal, of course. The API will simply show the error
- "Illegal version identifier '1.0.0'" if you try to access version
- 1.0.0 of a dataset via API. Here's an example
- https://dataverse.harvard.edu/api/datasets/:persistentId/versions/1.0.0?persistentId=doi:10.7910/DVN/TJCLKP
- Either way, a version of 1.0.0 is not helpful to a Dataverse
- user. The GUI may send you to an old version. The API will bark
- at you that version is illegal.
- We opened https://github.com/mlcommons/croissant/issues/609 to
- argue that 1.0 is a perfect valid version for a dataset.
- Hopefully in the future we can stop appending ".0"!
- By the way, it's technically possible to make the Croissant
- validator happy by passing 1.0 as a float. However, this is not
- compliant with the Croissant spec which (again) wants 1.0.0. We
- are sticking with strings because they are a more sane way to
- represent a version and they're what SemVer uses, even allowing
- "-alpha", etc.
Today I played around with where to put the real version of a Dataverse dataset (e.g. "1.0") but I can't find a good place without triggering schema.org violations.
Again, users need to real version number if they want to use Dataverse APIs successfully, as described in my comment above.
Right now my Croissant implementation appends an extra ".0" but I'd rather just put "1.0" and be able to configure the Croissant validator to ignore the warning, so I just opened this related issue:
- #643
@ccl-core Can you please take a look at this issue?
Hi @pdurbin , thank you for reporting this issue!
Just to clarify, 1.0 is a valid version: the validation script is just throwing a warning.
The mlcroissant library takes already care of the version casting, so you should be able to use your croissant file with 1.0 version.
Unless your issue is about the warning itself: do you think that the warning message is wrong or incomplete?
@ccl-core thanks for getting back to me.
First, in terms of validity, the spec implies that ANY string is valid, right? "foobar" and "supercalifragilisticexpialidocious" are both valid versions. This is because version in schema.org can be "Number" or "Text".
The table below summarizes my understanding of the how the validator and the spec treat the following values. (I put strings in quotes and left quotes off numbers.)
Current state
| Value | Validator | Spec |
|---|---|---|
| "1.0.0" | no warnings ✅ | recommended ✅ |
| 1.0 or 1 | no warnings ✅ | valid 🤷♀️ |
| "1.0" | warning ❌ | valid 🤷♀️ |
| "foobar" | warning ❌ | valid 🤷♀️ |
This feels off to me. I think a step in the right direction would be to allow the validator to not throw a warning for "1.0" (as a string), just like how it doesn't throw a warning for 1 or 1.0 (as numbers). Further, perhaps the spec should be updated to mention that 1 and 1.0 (as numbers) get special treatment (at least by the validator). And perhaps "1.0" (as a string) would get the same special treatment in the future. Here's an updated table:
Option 1: treat "1.0" like 1.0
| Value | Validator | Spec |
|---|---|---|
| "1.0.0" | no warnings ✅ | recommended ✅ |
| 1.0 or 1 | no warnings ✅ | special treatment mentioned 👀 |
| "1.0" | no warnings ✅ | special treatment mentioned 👀 |
| "foobar" | warning ❌ | valid 🤷♀️ |
If you're thinking, "just use 1.0 as a number instead of a string", the reason I'm reluctant to do this is that if Dataverse ever does switch from MAJOR.MINOR to MAJOR.MINOR.PATCH (as Croissant recommends) it would mean switching the type from number to string, which would be a breaking change for our users. I'd rather use strings in our Croissant export from the start so we can more easily switch from "1.0" to "1.0.0" in the future. I think we all know that in the real world, versions are usually strings, not numbers. SemVer is a good example.
Option 2: pass a flag to ignore version warnings
Another perfectly valid option could be to allow people like me who don't want to see the warning about version to pass a flag to the validator to suppress the "Version doesn't follow MAJOR.MINOR.PATCH" warning. I'm happy to see you're working on this (thanks!):
- #658
I imagine it will resolve the issue I opened:
- #643
With this option, I still think it would be good to update the spec to explain that 1 and 1.0 are given special treatment (no warnings shown):
| Value | Validator | Spec |
|---|---|---|
| "1.0.0" | no warnings ✅ | recommended ✅ |
| 1.0 or 1 | no warnings ✅ | special treatment mentioned 👀 |
| "1.0" | warning ❌ | valid 🤷♀️ |
| "foobar" | warning ❌ | valid 🤷♀️ |
I hope this helps! Thanks again!
Drive-by comment: I dislike the fact that versions can be numbers... As numbers, there is no difference between 1 and 1.0, even though we may consider them to be different values in terms of versions.
For the next version of Croissant, I wonder if we should constrain version to always be a string and recommend that it follows one of the patterns: "x", "x.y" or "x.y.z", where x y and z are integers, and maybe allow other strings but issue a (suppressible) warning if it doesn't follow one of these patterns.
What do you all think?
@benjelloun yeah, it sounds like we're in agreement that versions should be strings, not numbers.
In the schema.org world, I'm not familiar with how best to constrain a field like version that allows both Text and Number. I suppose we could have cr:version that has its own rules.
If we want to stick with SemVer, please keep in mind that it's not just integers and dots (periods). You'll see examples like "1.0.0-alpha" and "1.0.0+21AF26D3" in the spec (under "MAY" for pre-release and build metadata).
We can constrain sc:version by changing its documentation in our spec, and validate the field for datasets that specify that they comply with Croissant 1.0.
Indeed, for SemVer we would need some regexp like https://semver.org/#is-there-a-suggested-regular-expression-regex-to-check-a-semver-string
@ccl-core @benjelloun thanks for fixing this issue! ❤