Daft icon indicating copy to clipboard operation
Daft copied to clipboard

416 when reading from some huggingface datasets

Open universalmind303 opened this issue 4 months ago • 11 comments

Describe the bug

get Requested Range Not Satisfiable when trying to call hf://datasets/<org>/<repo>

To Reproduce

import daft

df = daft.read_parquet("hf://datasets/laion/conceptual-captions-12m-webdataset")
df.show()


DaftCoreException: DaftError::External Unable to open file https://huggingface.co/api/datasets/laion/conceptual-captions-12m-web
dataset/parquet/default/train/0.parquet: reqwest::Error { kind: Status(416, Some(ReasonPhrase(b"Requested Range Not Satisfiable"
))), url: "https://huggingface.co/api/datasets/laion/conceptual-captions-12m-webdataset/parquet/default/train/0.parquet" }

Expected behavior

able to read the dataset

Component(s)

Parquet

Additional context

No response

universalmind303 avatar Jul 15 '25 23:07 universalmind303

note: this does work for some datasets.

df = daft.read_parquet("hf://datasets/universalmind303/daft-docs")
df.show()
╭──────────────┬─────────────┬──────────────┬─────────────┬─────────╮
│ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species │
│ ---          ┆ ---         ┆ ---          ┆ ---         ┆ ---     │
│ Float64      ┆ Float64     ┆ Float64      ┆ Float64     ┆ Utf8    │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 5.1          ┆ 3.5         ┆ 1.4          ┆ 0.2         ┆ setosa  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4.9          ┆ 3           ┆ 1.4          ┆ 0.2         ┆ setosa  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4.7          ┆ 3.2         ┆ 1.3          ┆ 0.2         ┆ setosa  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4.6          ┆ 3.1         ┆ 1.5          ┆ 0.2         ┆ setosa  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5            ┆ 3.6         ┆ 1.4          ┆ 0.2         ┆ setosa  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5.4          ┆ 3.9         ┆ 1.7          ┆ 0.4         ┆ setosa  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4.6          ┆ 3.4         ┆ 1.4          ┆ 0.3         ┆ setosa  │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5            ┆ 3.4         ┆ 1.5          ┆ 0.2         ┆ setosa  │
╰──────────────┴─────────────┴──────────────┴─────────────┴─────────╯
(Showing first 8 rows)

universalmind303 avatar Jul 15 '25 23:07 universalmind303

When I originally introduced https://github.com/Eventual-Inc/Daft/pull/2701 i tested on a few other datasets and it seemed to work on all of the ones I tested on at the time. I know i tested it on some of the popular datasets at the time and they all worked.

I wonder if huggingface changed their apis since then?

universalmind303 avatar Jul 15 '25 23:07 universalmind303

so on further debugging, this is looking like a problem with the huggingface apis.

If i run the same query multiple times, it'll eventually go through and return the dataframe. But like 90% of the time it seems to fail with a 416.

universalmind303 avatar Jul 16 '25 00:07 universalmind303

Have confirmed that it is an issue with huggingface API's. I suspect it's a CDN/caching issue as it often responds with the header x-cache: Error from cloudfront and a different content range such as content-range: bytes */177.

You can reproduce by performing this request several times. Sometimes it's 416, other times its the proper 206

curl -v -L -H "Range: bytes=217875070-218006142" "https://huggingface.co/api/datasets/HuggingFaceTB/smoltalk2/parquet/Mid/Llama_Nemotron_Post_Training_Dataset_reasoning_r1/0.parquet"

I'm trying to figure out where to open up an issue with huggingface on this, but I don't believe any action on our end is needed.

universalmind303 avatar Jul 16 '25 15:07 universalmind303

opened up an issue on the datasets repo

universalmind303 avatar Jul 16 '25 18:07 universalmind303

Very interesting, it seems to be particularly because of the Range request. From my understanding, HTTP compatible software like CDN or caching systems are not required to support range requests, which is essentially the 416. I'm not sure why this only sometimes fails, maybe they have a mix of software.

Do you know when such requests are submitted? Are they done only for Parquet footer requests? If so, if we change the range from bytes=217875070-218006142 to bytes=-131072, that seems to work consistently. But if its requests for middle portions of the file, then I'm not sure.

srilman avatar Jul 16 '25 23:07 srilman

Ok it looks like that request was for the Parquet footer, but later requests might be not. I'm not sure.

If we're just worried about the footer, then I think this PR will help with it: https://github.com/Eventual-Inc/Daft/pull/4775

srilman avatar Jul 16 '25 23:07 srilman

Do you know when such requests are submitted? Are they done only for Parquet footer requests?

for parquet specifically, we do a lot of range requests. Any projection pushdown will do a range request, getting the metadata does range requests as well.

universalmind303 avatar Jul 17 '25 14:07 universalmind303

still is failing on some datasets as seen in #4907

df = daft.read_parquet("hf://datasets/stanfordnlp/snli/")
df.write_parquet("./out")

universalmind303 avatar Aug 05 '25 18:08 universalmind303

@rchowell it looks like it's failing on collect/write because when its being collected, the URLS are resolved to the concrete paths such as https://huggingface.co/api/datasets/stanfordnlp/snli/parquet/plain_text/test/0.parquet. So the root issue of huggingface producing invalid cache entries is still the same, but our workaround doesnt work on write because the write_parquet uses the resolved urls, and as a result, it uses the HttpSource instead of the HFSource.

universalmind303 avatar Aug 05 '25 19:08 universalmind303

UPDATE: it looks like the huggingface team has temporarily disabled the caching rule that was causing this issue.

universalmind303 avatar Aug 11 '25 16:08 universalmind303