dataset-viewer icon indicating copy to clipboard operation
dataset-viewer copied to clipboard

autoconverted parquet file has too big cells

Open severo opened this issue 1 year ago • 18 comments

See https://huggingface.co/datasets/imvladikon/hebrew_speech_coursera/discussions/1#6523d448b623a04e6c2f118a

From the logs I see this error

TooBigRows: Rows from parquet row groups are too big to be read: 313.33 MiB (max=286.10 MiB)

It looks like an issue on our side: the row groups in the parquet files at https://huggingface.co/datasets/imvladikon/hebrew_speech_coursera/tree/refs%2Fconvert%2Fparquet/default/train are too big to be read by the api. We'll investigate this, thanks for reporting

severo avatar Oct 10 '23 12:10 severo

Launched the recreation of imvladikon/hebrew_speech_coursera.

severo avatar Nov 06 '23 17:11 severo

-> JobManagerCrashedError 😮

severo avatar Nov 06 '23 17:11 severo

UnexpectedApiError for https://huggingface.co/datasets/danielz01/landmarks

libcommon.parquet_utils.TooBigRows: Rows from parquet row groups are too big to be read: 958.13 MiB (max=286.10 MiB)

lhoestq avatar Nov 23 '23 18:11 lhoestq

Note that the issue is that the cells are too big (in bytes) and it's not related to the row groups (I was mistaken in the title)

severo avatar Nov 23 '23 18:11 severo

Same UnexpectedApiError for https://huggingface.co/datasets/osunlp/Mind2Web, row group is 564MB for 100 rows

lhoestq avatar Nov 27 '23 13:11 lhoestq

row group is 564MB for 100 rows

The issue is that we don't allow big "cells". What should we do? Improve the error message? Allow big cells? Truncate?

severo avatar Nov 28 '23 11:11 severo

For the UI the best is to truncate, and a bonus would be to let the user click to expand a row

lhoestq avatar Nov 28 '23 14:11 lhoestq

so: I think we should add a query parameter, like: "full: boolean", or "truncate: boolean", to /rows, /search, /filter.

severo avatar Nov 28 '23 14:11 severo

Also reported here: https://huggingface.co/datasets/UmaDiffusion/ULTIMA/discussions/1

severo avatar Dec 11 '23 22:12 severo

Somewhat related: https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae

We should truncate more aggressively, even for /first-rows

severo avatar Dec 15 '23 08:12 severo

Hi, Thanks for bringing this up. You are probably aware of this but once I click on the „Viewer“, the data is visible there. Best, IMG_5035 IMG_5036

mikehemberger avatar Dec 15 '23 10:12 mikehemberger

Here is another „raw“ image dataset that I’ve uploaded via the web-interface (assuming it was faster then pushing it from a notebook). Hope this helps Best, M https://huggingface.co/datasets/mikehemberger/medicinal-plants/discussions/2#657c317f1953a4194ad0952d

mikehemberger avatar Dec 15 '23 11:12 mikehemberger

The issue for https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae is about first_rows truncation, not about autoconverted parquet files no ?

maybe open a separate issue

lhoestq avatar Dec 15 '23 11:12 lhoestq

yes, I brought the discussion here, but you're right, the issue is somewhat related. Maybe we can fix both at the same time though.

severo avatar Dec 15 '23 11:12 severo

Created https://github.com/huggingface/datasets-server/issues/2215 for https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae

severo avatar Dec 18 '23 13:12 severo

See here too: https://huggingface.co/datasets/ideepankarsharma2003/MidjourneV6_Image_small/discussions/1

severo avatar Feb 05 '24 10:02 severo

Another one: https://huggingface.co/datasets/Libertify/stock-sight/discussions/3

severo avatar Feb 09 '24 11:02 severo

was there any action on this?

twobob avatar Jul 17 '24 09:07 twobob