Panic when running `create table clickbench.hits as select * from 'https://datasets.clickhouse.com/hits_compatible/hits.parquet';`
Description
> create table clickbench.hits as select * from 'https://datasets.clickhouse.com/hits_compatible/hits.parquet';
Error: External error: Exec error: Remote session error: unable to dispatch table access: status: Cancelled, message: "h2 protocol error: http2 error: stream error received: stream no longer needed", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Thu, 14 Sep 2023 14:59:08 GMT", "content-length": "0"} }
Logged error:
{
"textPayload": "thread 'server-thread-13' panicked at 'slice index starts at 18446744073707112300 but ends at 14779976438', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-30.0.0/src/datasource/file_format/parquet.rs:467:14",
"insertId": "9ixrs056pk678dvt",
"resource": {
"type": "k8s_container",
"labels": {
"namespace_name": "qa",
"location": "us-central1",
"project_id": "glaredb-dev-playground",
"cluster_name": "glaredb-gke-public",
"container_name": "glaredb",
"pod_name": "glaredb-8ea6a5f9-82a5-4584-bbee-3bc4f75bccdc-57974c46fd-xptzx"
}
},
"timestamp": "2023-09-14T14:59:07.282871522Z",
"severity": "ERROR",
"labels": {
"k8s-pod/type": "glaredb",
"k8s-pod/id": "8ea6a5f9-82a5-4584-bbee-3bc4f75bccdc",
"k8s-pod/pod-template-hash": "57974c46fd",
"k8s-pod/organization": "14f8e619-c760-4edb-9021-e2f11b8f8150",
"k8s-pod/name": "glaredb-8ea6a5f9-82a5-4584-bbee-3bc4f75bccdc",
"compute.googleapis.com/resource_name": "gk3-glaredb-gke-public-nap-765xd42g-e66161ba-moe1"
},
"logName": "projects/glaredb-dev-playground/logs/stderr",
"receiveTimestamp": "2023-09-14T14:59:08.688448716Z"
}
This could be an issue in dd in fetch_parquet_metadata: https://github.com/apache/arrow-datafusion/blob/58fc80eddaf93d4c6399e4ac4fc1234649b26e43/datafusion/core/src/datasource/file_format/parquet.rs#L408-L470
Or an issue with how we construct object meta for http sources: https://github.com/glaredb/glaredb/blob/9e0e4b3dd8b196d151c5515cb58fe91273fa4f91/crates/datasources/src/object_store/http.rs#L64-L84
@scsmithr does this only fail when connected to cloud? I can run the query locally (albeit very slowly).
The size of ObjectMeta that we are setting using the content_length seems to be wrong.
From the above error log "slice index starts at 18446744073707112300 but ends at 14779976438" , 14779976438 is the size we are setting in the ObjectMeta for this file, content_length can be different in cases where compression is used.
Were you able to reproduce the panic? I tried on cloud, and only get OOM. Running locally gets me:
Error: External error: Parquet error: Parquet error: Could not parse metadata: protocol error
content_length seems to be wrong
What's the content length we're getting back? Does that match the size of the file on disk? There's no compression being used here, since it's just a binary stream.
If using content length was wrong (in general), queries like select * from 'https://github.com/GlareDB/glaredb/raw/main/testdata/parquet/userdata1.parquet'; wouldn't work.
Also 18446744073707112300 indicates an underflow.
Content length (through wget): 14779976446 Size on disk: 14779976446
I get the same error locally. Also content_length is 14779976446.
Still trying to fully reproduce it, it takes around 20-25 mins to run once.
Size of the file is approx 15Gb, so all that adds up wrt size.
looking at how that underflow is generated and being used now
it takes around 20-25 mins to run once.
(Aside: Do we have a way to set a timeout for statements, something like https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-STATEMENT-TIMEOUT)
It seems to me, based on the message, and looking at the implementation of dispatch_access (and the discussion here!) that "we have a big file that's slow to download, hits a timeout, we have [less than the actual size of the file, but we read the the metadata before so we know how long it should be]," and rather than catch the error at that point ~~we attempt to read the parquet file, and end up with an underflow/location that doesn't exist.~~
The panic at datafusion/src/datasource/file_format/parquet.rs:467:14 is where datafusion is trying to get the parquet metadata (from the end of the file, which we haven't gotten yet (presumably, because it times out). https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/file_format/parquet.rs#L455-L474 -- the panic is in the else statement, because [reason], presumably because we have some data (enough to get to the ELSE) and not all the data (for this to be correct?) If this is the case, there's probably some place where we start reading the file before we have confirmed that we haven't reached a timeout. Not sure I know enough about the specific plumbing to find the case where this would be.
I spent a little (but not very much time poking through datafusion) and it seems to me that:
- the DF code could be a little more defensive and make sure that the metadata it was trying to read was valid before it attempted to slice things (maybe reasonable, maybe hard to do perfectly.)
- I could be wrong about the timeout (the error in the original gives us a hard-to-understand stack of errors, and I'm not sure if the timeout/cancel is in the gRPC level (and propagated down to the objectstore/df components) or something that happens at the objectstore/df level and is propagated up to grpc.)
- https://github.com/apache/arrow-rs/blob/master/object_store/src/lib.rs#L361-L365 <- this makes it seem like we're using a non-atomic handler in our HTTP table provider (but this might be a distraction?)
I think we could probably reproduce this on a shorter loop by artificially lowering timeouts in a test environment, or (potentially) by construction a file with a specific kind of corruption. Some theories could probably be validated/rejected that way.
~~bug is in datafusion, where it tries to infer the schema and panics when it fails to slice it with that underflow.~~
The bug is in here : https://github.com/apache/arrow-rs/blob/f0455d12ddcb174f1f8d2bbfd5874f7b708c9a74/object_store/src/lib.rs#L780C5-L809
instead of returning the correct range, it returns the whole object. So we will need to push the fix upstream for this one.
@adhish20 @scsmithr FYI, there were some changes upstream in object_store that makes this specific query flaky.
https://datasets.clickhouse.com/hits_compatible/hits.parquet sometimes fulfills a range request, other times it does not. It seems to be non-deterministic to when it will & won't. Object store now errors if the range request was not fulfilled.
@adhish20 @scsmithr FYI, there were some changes upstream in object_store that makes this specific query flaky.
https://datasets.clickhouse.com/hits_compatible/hits.parquetsometimes fulfills arangerequest, other times it does not. It seems to be non-deterministic to when it will & won't. Object store now errors if the range request was not fulfilled.
Ah yes, I was able to reproduce and see the non-deterministic behaviour by pulling in your change that throws on a range query if the server can't fulfil it.
@universalmind303 Should we mark this as closed then?
Ah yes, I was able to reproduce and see the non-deterministic behaviour by pulling in your change that throws on a range query if the server can't fulfil it.
specifically this block of code. If it returns a 200 instead of the expected 206, object store errors out.
https://github.com/apache/arrow-rs/blob/f0455d12ddcb174f1f8d2bbfd5874f7b708c9a74/object_store/src/http/client.rs#L320C1-L328C10
As discussed on the dev call today:
- [ ] turn this panic into an error message: "upstream http service returned unexpected results, contact your provider"
- [ ] retry the operation a couple of times. [hypothesis is that caching/CDNs mean that sometimes things behave weirdly, and retires will fix this test and probably most other "200 not 206"