croissant issues

Read gzipped files.

Currently, we don't have a way to read gzipped files. PR #636 introduces a hack to infer whether a file has to be opened with gzip from its name. Currently,...

ccl-core

enhancement

good first issue

Rerun Croissant Health reports for Hugging Face and OpenML

1

marcenacp

contentUrl for each format of a file (original proprietary vs archival)

4

For some proprietary file formats such as Excel, Stata, and SPSS, Dataverse creates non-proprietary formats (TSV and RData) of uploaded files for archival purposes. In addition, a TSV version might...

pdurbin

enhancement

1.0 as a string should be a valid version for a dataset

8

Dataverse uses MAJOR.MINOR for dataset versions like 1.0, 1.1, 2.0. I believe this should be valid. I'm aware that the validator wants MAJOR.MINOR.PATCH (output below) and that under the [Version](https://mlcommons.github.io/croissant/docs/croissant-spec.html#version)...

pdurbin

Add schema.org sdLicense

6

Add a field for the metadata's license See #544

mkuchnik

enhancement

add flag to validator to ignore certain warnings

`mlcroissant validate --jsonld croissant.json` is extremely helpful but I would like to configure it to ignore certain warnings. Perhaps we could add an `--ignore` flag like `flake8` has. Here's an...

pdurbin

"images/filename" should have an attribute "@type": "https://schema.org/Text". Got http://mlcommons.org/croissant/Field instead.

4

Hi, I recently started working with Croissant on creating a dataset for semantic segmentation. The dataset has images and labels, both in *.tif format. There were no errors while programmatically...

venkanna37

`schema:Enumerations` does not exist

2

https://github.com/mlcommons/croissant/blob/main/docs/croissant.ttl: ``` croissant:ContentExtractionEnumeration a rdf:Class ; rdfs:label "ContexExtractionEnumeration" ; rdfs:comment "Specifies which content to extract from a file. One of \"all\", \"lines\", or \"lineNumbers\"." ; rdfs:subClassOf schema:Enumerations . ``` But...

VladimirAlexiev

use XSD datatypes not schema.org datatypes

7

Schema.org datatypes are not good: - they go against standard XSD datatypes that are the foundation of both XML and RDF. - they are tentative (don't specify a lexical representation),...

VladimirAlexiev

1.1

consider reusing CSVW and DQV

1

It's great that you reuse Schema.org. But please also consider reusing these: - [CSVW](https://csvw.org/standards.html) is for describing the semantics of CSV, or even exposing CSV tables as RDF. I see...

VladimirAlexiev

croissant
croissant copied to clipboard

Metadata

Read gzipped files.

Rerun Croissant Health reports for Hugging Face and OpenML

contentUrl for each format of a file (original proprietary vs archival)

1.0 as a string should be a valid version for a dataset

Add schema.org sdLicense

add flag to validator to ignore certain warnings

"images/filename" should have an attribute "@type": "https://schema.org/Text". Got http://mlcommons.org/croissant/Field instead.

`schema:Enumerations` does not exist

use XSD datatypes not schema.org datatypes

consider reusing CSVW and DQV

← Metadata

Owner

Metadata

croissant croissant copied to clipboard

Metadata

← Metadata

Owner

Metadata

croissant
croissant copied to clipboard