Michael Kuchnik

Results 7 issues of Michael Kuchnik

Add a field for the metadata's license See #544

enhancement

ImageNet is a common workload for benchmarking. This PR adds a recipe for running ImageNet in PyTorch with a Croissant loader, which can be useful for both system and ML...

enhancement
WIP

The [introduction notebook](https://github.com/mlcommons/croissant/blob/main/python/mlcroissant/recipes/introduction.ipynb) is creating a dataset that is different from the [GPT-3 dataset](https://github.com/mlcommons/croissant/blob/main/datasets/1.0/gpt-3/metadata.json). For example, the former uses `record_set = "jsonl"` while the latter uses `record_set = "default"`.

good first issue

MIME types unfortunately [don't officially support](https://www.iana.org/assignments/media-types/media-types.xhtml) tar formats. Croissant uses "application/x-tar", since "x-" is reserved for experimental types. MIME types [do support](https://www.rfc-editor.org/rfc/rfc6713.html) compression methods, such as gzip and zlib, yielding...

invalid

A mechanism such as https://schema.org/inLanguage can be used to label languages used in a dataset. For example, for translation tasks, there may be hundreds of languages represented in the dataset....

enhancement

Croissant files may have a license different from the original dataset. Discussion with @benjelloun and Elena suggested https://schema.org/sdLicense is a starting point.

enhancement

For peak performance, each loader may have its own way of achieving certain operations. It would be useful to offer an intermediate representation that can be "lowered" to each respective...

enhancement