croissant icon indicating copy to clipboard operation
croissant copied to clipboard

Overly restrictive validation for content_size property

Open amercader opened this issue 11 months ago • 3 comments

With the metadata object below, running the validation command fails:

mlcroissant validate --jsonld croissant_minimal.jsonld
E0210 11:11:52.866241 126556827741056 validate.py:55] Found the following 1 error(s) during the validation:
  -  `content_size` should have type https://schema.org/Text, but got int

Integers are a reasonable type to describe file sizes and many systems output them as such. Text might make sense to provide a unit, but the spec states if no unit is provided bytes should be assumed so that should cover the use of ints.

File size in (mega/kilo/…)bytes. Defaults to bytes if a unit is not specified

{
  "@context": {
    "@language": "en",
    "@vocab": "https://schema.org/",
    "citeAs": "cr:citeAs",
    "column": "cr:column",
    "conformsTo": "dct:conformsTo",
    "cr": "http://mlcommons.org/croissant/",
    "rai": "http://mlcommons.org/croissant/RAI/",
    "data": {
      "@id": "cr:data",
      "@type": "@json"
    },
    "dataType": {
      "@id": "cr:dataType",
      "@type": "@vocab"
    },
    "dct": "http://purl.org/dc/terms/",
    "examples": {
      "@id": "cr:examples",
      "@type": "@json"
    },
    "extract": "cr:extract",
    "field": "cr:field",
    "fileProperty": "cr:fileProperty",
    "fileObject": "cr:fileObject",
    "fileSet": "cr:fileSet",
    "format": "cr:format",
    "includes": "cr:includes",
    "isLiveDataset": "cr:isLiveDataset",
    "jsonPath": "cr:jsonPath",
    "key": "cr:key",
    "md5": "cr:md5",
    "parentField": "cr:parentField",
    "path": "cr:path",
    "recordSet": "cr:recordSet",
    "references": "cr:references",
    "regex": "cr:regex",
    "repeated": "cr:repeated",
    "replace": "cr:replace",
    "sc": "https://schema.org/",
    "separator": "cr:separator",
    "source": "cr:source",
    "subField": "cr:subField",
    "transform": "cr:transform"
  },    
  "@type": "sc:Dataset",
  "name": "minimal_example_with_recommended_fields",
  "description": "This is a minimal example, including the required and the recommended fields.",
  "license": "https://creativecommons.org/licenses/by/4.0/",
  "url": "https://example.com/dataset/recipes/minimal-recommended",
  "distribution": [
    {
      "@type": "sc:FileObject",
      "@id": "minimal.csv",
      "name": "minimal.csv",
      "contentUrl": "data/minimal.csv",
      "encodingFormat": "text/csv",
      "contentSize": 7915913,
      "sha256": "48a7c257f3c90b2a3e529ddd2cca8f4f1bd8e49ed244ef53927649504ac55354"
    }
  ]
}

amercader avatar Feb 10 '25 10:02 amercader

Will this fix?

"contentSize": "7915913",

bact avatar Mar 06 '25 02:03 bact

@bact yes, sure, quoting the value will make the validation error go away. I'm just suggesting that ints could be allowed to make it a bit easier for implementers.

amercader avatar Mar 06 '25 13:03 amercader

I agree, we should support integers as a type for contentSize.

benjelloun avatar Mar 06 '25 13:03 benjelloun