Overly restrictive validation for content_size property
With the metadata object below, running the validation command fails:
mlcroissant validate --jsonld croissant_minimal.jsonld
E0210 11:11:52.866241 126556827741056 validate.py:55] Found the following 1 error(s) during the validation:
- `content_size` should have type https://schema.org/Text, but got int
Integers are a reasonable type to describe file sizes and many systems output them as such. Text might make sense to provide a unit, but the spec states if no unit is provided bytes should be assumed so that should cover the use of ints.
File size in (mega/kilo/…)bytes. Defaults to bytes if a unit is not specified
{
"@context": {
"@language": "en",
"@vocab": "https://schema.org/",
"citeAs": "cr:citeAs",
"column": "cr:column",
"conformsTo": "dct:conformsTo",
"cr": "http://mlcommons.org/croissant/",
"rai": "http://mlcommons.org/croissant/RAI/",
"data": {
"@id": "cr:data",
"@type": "@json"
},
"dataType": {
"@id": "cr:dataType",
"@type": "@vocab"
},
"dct": "http://purl.org/dc/terms/",
"examples": {
"@id": "cr:examples",
"@type": "@json"
},
"extract": "cr:extract",
"field": "cr:field",
"fileProperty": "cr:fileProperty",
"fileObject": "cr:fileObject",
"fileSet": "cr:fileSet",
"format": "cr:format",
"includes": "cr:includes",
"isLiveDataset": "cr:isLiveDataset",
"jsonPath": "cr:jsonPath",
"key": "cr:key",
"md5": "cr:md5",
"parentField": "cr:parentField",
"path": "cr:path",
"recordSet": "cr:recordSet",
"references": "cr:references",
"regex": "cr:regex",
"repeated": "cr:repeated",
"replace": "cr:replace",
"sc": "https://schema.org/",
"separator": "cr:separator",
"source": "cr:source",
"subField": "cr:subField",
"transform": "cr:transform"
},
"@type": "sc:Dataset",
"name": "minimal_example_with_recommended_fields",
"description": "This is a minimal example, including the required and the recommended fields.",
"license": "https://creativecommons.org/licenses/by/4.0/",
"url": "https://example.com/dataset/recipes/minimal-recommended",
"distribution": [
{
"@type": "sc:FileObject",
"@id": "minimal.csv",
"name": "minimal.csv",
"contentUrl": "data/minimal.csv",
"encodingFormat": "text/csv",
"contentSize": 7915913,
"sha256": "48a7c257f3c90b2a3e529ddd2cca8f4f1bd8e49ed244ef53927649504ac55354"
}
]
}
Will this fix?
"contentSize": "7915913",
@bact yes, sure, quoting the value will make the validation error go away. I'm just suggesting that ints could be allowed to make it a bit easier for implementers.
I agree, we should support integers as a type for contentSize.