Sylvain Lesage
Sylvain Lesage
When the number of columns is above 1000, we don't process the split. See https://github.com/huggingface/datasets-server/issues/1143. Should we instead "truncate", and only process the first 1000 columns, and give a hint...
We currently return: ``` "sha256": "https://github.com/mlcommons/croissant/issues/80" ``` See https://github.com/mlcommons/croissant/issues/80. cc @marcenacp
eg https://datasets-server.huggingface.co/croissant?dataset=mnist&full=true See https://github.com/mlcommons/croissant/blob/main/docs/howto/specify-splits.md The splits could be specified at the RecordSet level.
In [moonlanding](https://github.com/huggingface/moon-landing/pull/8565#discussion_r1440224297) (internal) we call several endpoints in parallel. We could group them in one call to datasets-server, with all the available information in one response.
And monitor regularly this error (as well as any non-expected error, related to https://github.com/huggingface/datasets-server/issues/1443)
See https://huggingface.co/datasets/imvladikon/hebrew_speech_coursera/discussions/1#6523d448b623a04e6c2f118a > > From the logs I see this error > > TooBigRows: Rows from parquet row groups are too big to be read: 313.33 MiB (max=286.10 MiB) >...
Do we want to replace the TypedDict objects with dataclasses? If so: note that the objects we serialize should be serialized too without any change by orjson, at the price...
See https://huggingface.co/datasets/HuggingFaceM4/WebSight/discussions/2
Some users are annoyed by the discussions opened by "parquet-convert" bot.
See https://huggingface.co/datasets/1rsh/speech-rj-hi/discussions/2 First rows gives: ``` soundfile.LibsndfileError: Error opening : Format not recognised. ``` In the trace we also see: ``` Decoding failed. ffmpeg return error code: 1. ... moov...