dataset-viewer autoconverted parquet file has too big cells

See https://huggingface.co/datasets/imvladikon/hebrew_speech_coursera/discussions/1#6523d448b623a04e6c2f118a

From the logs I see this error

TooBigRows: Rows from parquet row groups are too big to be read: 313.33 MiB (max=286.10 MiB)

It looks like an issue on our side: the row groups in the parquet files at https://huggingface.co/datasets/imvladikon/hebrew_speech_coursera/tree/refs%2Fconvert%2Fparquet/default/train are too big to be read by the api. We'll investigate this, thanks for reporting

Oct 10 '23 12:10 severo

Launched the recreation of imvladikon/hebrew_speech_coursera.

Nov 06 '23 17:11 severo

-> JobManagerCrashedError 😮

Nov 06 '23 17:11 severo

UnexpectedApiError for https://huggingface.co/datasets/danielz01/landmarks

libcommon.parquet_utils.TooBigRows: Rows from parquet row groups are too big to be read: 958.13 MiB (max=286.10 MiB)

Nov 23 '23 18:11 lhoestq

Note that the issue is that the cells are too big (in bytes) and it's not related to the row groups (I was mistaken in the title)

Nov 23 '23 18:11 severo

Same UnexpectedApiError for https://huggingface.co/datasets/osunlp/Mind2Web, row group is 564MB for 100 rows

Nov 27 '23 13:11 lhoestq

row group is 564MB for 100 rows

The issue is that we don't allow big "cells". What should we do? Improve the error message? Allow big cells? Truncate?

Nov 28 '23 11:11 severo

For the UI the best is to truncate, and a bonus would be to let the user click to expand a row

Nov 28 '23 14:11 lhoestq

so: I think we should add a query parameter, like: "full: boolean", or "truncate: boolean", to /rows, /search, /filter.

Nov 28 '23 14:11 severo

Also reported here: https://huggingface.co/datasets/UmaDiffusion/ULTIMA/discussions/1

Dec 11 '23 22:12 severo

Somewhat related: https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae

We should truncate more aggressively, even for /first-rows

Dec 15 '23 08:12 severo

Hi, Thanks for bringing this up. You are probably aware of this but once I click on the „Viewer“, the data is visible there. Best,

Dec 15 '23 10:12 mikehemberger

Here is another „raw“ image dataset that I’ve uploaded via the web-interface (assuming it was faster then pushing it from a notebook). Hope this helps Best, M https://huggingface.co/datasets/mikehemberger/medicinal-plants/discussions/2#657c317f1953a4194ad0952d

Dec 15 '23 11:12 mikehemberger

The issue for https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae is about first_rows truncation, not about autoconverted parquet files no ?

maybe open a separate issue

Dec 15 '23 11:12 lhoestq

yes, I brought the discussion here, but you're right, the issue is somewhat related. Maybe we can fix both at the same time though.

Dec 15 '23 11:12 severo

Created https://github.com/huggingface/datasets-server/issues/2215 for https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae

Dec 18 '23 13:12 severo

See here too: https://huggingface.co/datasets/ideepankarsharma2003/MidjourneV6_Image_small/discussions/1

Feb 05 '24 10:02 severo

Another one: https://huggingface.co/datasets/Libertify/stock-sight/discussions/3

Feb 09 '24 11:02 severo

was there any action on this?

Jul 17 '24 09:07 twobob