Daft
Daft copied to clipboard
416 when reading from some huggingface datasets
Describe the bug
get Requested Range Not Satisfiable when trying to call hf://datasets/<org>/<repo>
To Reproduce
import daft
df = daft.read_parquet("hf://datasets/laion/conceptual-captions-12m-webdataset")
df.show()
DaftCoreException: DaftError::External Unable to open file https://huggingface.co/api/datasets/laion/conceptual-captions-12m-web
dataset/parquet/default/train/0.parquet: reqwest::Error { kind: Status(416, Some(ReasonPhrase(b"Requested Range Not Satisfiable"
))), url: "https://huggingface.co/api/datasets/laion/conceptual-captions-12m-webdataset/parquet/default/train/0.parquet" }
Expected behavior
able to read the dataset
Component(s)
Parquet
Additional context
No response
note: this does work for some datasets.
df = daft.read_parquet("hf://datasets/universalmind303/daft-docs")
df.show()
╭──────────────┬─────────────┬──────────────┬─────────────┬─────────╮
│ sepal_length ┆ sepal_width ┆ petal_length ┆ petal_width ┆ species │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ Float64 ┆ Float64 ┆ Float64 ┆ Float64 ┆ Utf8 │
╞══════════════╪═════════════╪══════════════╪═════════════╪═════════╡
│ 5.1 ┆ 3.5 ┆ 1.4 ┆ 0.2 ┆ setosa │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4.9 ┆ 3 ┆ 1.4 ┆ 0.2 ┆ setosa │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4.7 ┆ 3.2 ┆ 1.3 ┆ 0.2 ┆ setosa │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4.6 ┆ 3.1 ┆ 1.5 ┆ 0.2 ┆ setosa │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 3.6 ┆ 1.4 ┆ 0.2 ┆ setosa │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5.4 ┆ 3.9 ┆ 1.7 ┆ 0.4 ┆ setosa │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 4.6 ┆ 3.4 ┆ 1.4 ┆ 0.3 ┆ setosa │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 5 ┆ 3.4 ┆ 1.5 ┆ 0.2 ┆ setosa │
╰──────────────┴─────────────┴──────────────┴─────────────┴─────────╯
(Showing first 8 rows)
When I originally introduced https://github.com/Eventual-Inc/Daft/pull/2701 i tested on a few other datasets and it seemed to work on all of the ones I tested on at the time. I know i tested it on some of the popular datasets at the time and they all worked.
I wonder if huggingface changed their apis since then?
so on further debugging, this is looking like a problem with the huggingface apis.
If i run the same query multiple times, it'll eventually go through and return the dataframe. But like 90% of the time it seems to fail with a 416.
Have confirmed that it is an issue with huggingface API's. I suspect it's a CDN/caching issue as it often responds with the header x-cache: Error from cloudfront and a different content range such as content-range: bytes */177.
You can reproduce by performing this request several times. Sometimes it's 416, other times its the proper 206
curl -v -L -H "Range: bytes=217875070-218006142" "https://huggingface.co/api/datasets/HuggingFaceTB/smoltalk2/parquet/Mid/Llama_Nemotron_Post_Training_Dataset_reasoning_r1/0.parquet"
I'm trying to figure out where to open up an issue with huggingface on this, but I don't believe any action on our end is needed.
opened up an issue on the datasets repo
Very interesting, it seems to be particularly because of the Range request. From my understanding, HTTP compatible software like CDN or caching systems are not required to support range requests, which is essentially the 416. I'm not sure why this only sometimes fails, maybe they have a mix of software.
Do you know when such requests are submitted? Are they done only for Parquet footer requests? If so, if we change the range from bytes=217875070-218006142 to bytes=-131072, that seems to work consistently. But if its requests for middle portions of the file, then I'm not sure.
Ok it looks like that request was for the Parquet footer, but later requests might be not. I'm not sure.
If we're just worried about the footer, then I think this PR will help with it: https://github.com/Eventual-Inc/Daft/pull/4775
Do you know when such requests are submitted? Are they done only for Parquet footer requests?
for parquet specifically, we do a lot of range requests. Any projection pushdown will do a range request, getting the metadata does range requests as well.
still is failing on some datasets as seen in #4907
df = daft.read_parquet("hf://datasets/stanfordnlp/snli/")
df.write_parquet("./out")
@rchowell it looks like it's failing on collect/write because when its being collected, the URLS are resolved to the concrete paths such as https://huggingface.co/api/datasets/stanfordnlp/snli/parquet/plain_text/test/0.parquet. So the root issue of huggingface producing invalid cache entries is still the same, but our workaround doesnt work on write because the write_parquet uses the resolved urls, and as a result, it uses the HttpSource instead of the HFSource.
UPDATE: it looks like the huggingface team has temporarily disabled the caching rule that was causing this issue.