croissant icon indicating copy to clipboard operation
croissant copied to clipboard

Croissant is a high-level format for machine learning datasets that brings together four rich layers.

Results 226 croissant issues
Sort by recently updated
recently updated
newest added

Currently, we don't have a way to read gzipped files. PR #636 introduces a hack to infer whether a file has to be opened with gzip from its name. Currently,...

enhancement
good first issue

For some proprietary file formats such as Excel, Stata, and SPSS, Dataverse creates non-proprietary formats (TSV and RData) of uploaded files for archival purposes. In addition, a TSV version might...

enhancement

Dataverse uses MAJOR.MINOR for dataset versions like 1.0, 1.1, 2.0. I believe this should be valid. I'm aware that the validator wants MAJOR.MINOR.PATCH (output below) and that under the [Version](https://mlcommons.github.io/croissant/docs/croissant-spec.html#version)...

Add a field for the metadata's license See #544

enhancement

`mlcroissant validate --jsonld croissant.json` is extremely helpful but I would like to configure it to ignore certain warnings. Perhaps we could add an `--ignore` flag like `flake8` has. Here's an...

Hi, I recently started working with Croissant on creating a dataset for semantic segmentation. The dataset has images and labels, both in *.tif format. There were no errors while programmatically...

https://github.com/mlcommons/croissant/blob/main/docs/croissant.ttl: ``` croissant:ContentExtractionEnumeration a rdf:Class ; rdfs:label "ContexExtractionEnumeration" ; rdfs:comment "Specifies which content to extract from a file. One of \"all\", \"lines\", or \"lineNumbers\"." ; rdfs:subClassOf schema:Enumerations . ``` But...

Schema.org datatypes are not good: - they go against standard XSD datatypes that are the foundation of both XML and RDF. - they are tentative (don't specify a lexical representation),...

1.1

It's great that you reuse Schema.org. But please also consider reusing these: - [CSVW](https://csvw.org/standards.html) is for describing the semantics of CSV, or even exposing CSV tables as RDF. I see...