dataset-viewer
dataset-viewer copied to clipboard
autoconverted parquet file has too big cells
See https://huggingface.co/datasets/imvladikon/hebrew_speech_coursera/discussions/1#6523d448b623a04e6c2f118a
From the logs I see this error
TooBigRows: Rows from parquet row groups are too big to be read: 313.33 MiB (max=286.10 MiB)
It looks like an issue on our side: the row groups in the parquet files at https://huggingface.co/datasets/imvladikon/hebrew_speech_coursera/tree/refs%2Fconvert%2Fparquet/default/train are too big to be read by the api. We'll investigate this, thanks for reporting
Launched the recreation of imvladikon/hebrew_speech_coursera
.
-> JobManagerCrashedError 😮
UnexpectedApiError
for https://huggingface.co/datasets/danielz01/landmarks
libcommon.parquet_utils.TooBigRows: Rows from parquet row groups are too big to be read: 958.13 MiB (max=286.10 MiB)
Note that the issue is that the cells are too big (in bytes) and it's not related to the row groups (I was mistaken in the title)
Same UnexpectedApiError
for https://huggingface.co/datasets/osunlp/Mind2Web, row group is 564MB for 100 rows
row group is 564MB for 100 rows
The issue is that we don't allow big "cells". What should we do? Improve the error message? Allow big cells? Truncate?
For the UI the best is to truncate, and a bonus would be to let the user click to expand a row
so: I think we should add a query parameter, like: "full: boolean", or "truncate: boolean", to /rows, /search, /filter.
Also reported here: https://huggingface.co/datasets/UmaDiffusion/ULTIMA/discussions/1
Somewhat related: https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae
We should truncate more aggressively, even for /first-rows
Hi,
Thanks for bringing this up. You are probably aware of this but once I click on the „Viewer“, the data is visible there.
Best,
Here is another „raw“ image dataset that I’ve uploaded via the web-interface (assuming it was faster then pushing it from a notebook). Hope this helps Best, M https://huggingface.co/datasets/mikehemberger/medicinal-plants/discussions/2#657c317f1953a4194ad0952d
The issue for https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae is about first_rows truncation, not about autoconverted parquet files no ?
maybe open a separate issue
yes, I brought the discussion here, but you're right, the issue is somewhat related. Maybe we can fix both at the same time though.
Created https://github.com/huggingface/datasets-server/issues/2215 for https://huggingface.co/datasets/mikehemberger/inat_2021_train_mini_plantae
See here too: https://huggingface.co/datasets/ideepankarsharma2003/MidjourneV6_Image_small/discussions/1
Another one: https://huggingface.co/datasets/Libertify/stock-sight/discussions/3
was there any action on this?