croissant icon indicating copy to clipboard operation
croissant copied to clipboard

Standardize MIME Types for Compressed tar Files

Open mkuchnik opened this issue 1 year ago • 0 comments

MIME types unfortunately don't officially support tar formats. Croissant uses "application/x-tar", since "x-" is reserved for experimental types. MIME types do support compression methods, such as gzip and zlib, yielding "application/gzip" and "application/zlib", respectively. Since tar formats compose with compression (e.g., .tar.gz), it's common for the combination to cause confusion.

Croissant should provide a recommendation for what to do in such cases. In the loader, there is an implicit conversion from ".tar.gz" to ".tar" here. Meanwhile, the editor is associating ".tar.gz" files as ".gz" (using extensions and file headers). Thus, the same file has different types depending on the implementation. For example, a user may see the following error when using a ".tar.gz" file in the editor, even though the same extension is already used in the flores-200 dataset:

NotImplementedError: File type FileType(name='GZIP', encoding_format='application/gzip', extensions=['gz']) is not supported. Please, open an issue on GitHub: https://github.com/mlcommons/croissant/issues/new

This can be fixed if the right approach is formalized. Using "application/x-tar+gzip" or similar (the spec doc is using "application/x-gzip") would prevent confusion, especially when FileObjects may be confused with FileSets. Otherwise, Croissant can stick to one (e.g., "application/gzip") and use implicit behavior to attempt a recovery for common but unspecified cases (e.g., .tar.gz).

mkuchnik avatar Feb 21 '24 04:02 mkuchnik