glaredb icon indicating copy to clipboard operation
glaredb copied to clipboard

Panic when running `create table clickbench.hits as select * from 'https://datasets.clickhouse.com/hits_compatible/hits.parquet';`

Open scsmithr opened this issue 2 years ago • 16 comments

Description

> create table clickbench.hits as select * from 'https://datasets.clickhouse.com/hits_compatible/hits.parquet';
Error: External error: Exec error: Remote session error: unable to dispatch table access: status: Cancelled, message: "h2 protocol error: http2 error: stream error received: stream no longer needed", details: [], metadata: MetadataMap { headers: {"content-type": "application/grpc", "date": "Thu, 14 Sep 2023 14:59:08 GMT", "content-length": "0"} }

Logged error:

{
  "textPayload": "thread 'server-thread-13' panicked at 'slice index starts at 18446744073707112300 but ends at 14779976438', /usr/local/cargo/registry/src/index.crates.io-6f17d22bba15001f/datafusion-30.0.0/src/datasource/file_format/parquet.rs:467:14",
  "insertId": "9ixrs056pk678dvt",
  "resource": {
    "type": "k8s_container",
    "labels": {
      "namespace_name": "qa",
      "location": "us-central1",
      "project_id": "glaredb-dev-playground",
      "cluster_name": "glaredb-gke-public",
      "container_name": "glaredb",
      "pod_name": "glaredb-8ea6a5f9-82a5-4584-bbee-3bc4f75bccdc-57974c46fd-xptzx"
    }
  },
  "timestamp": "2023-09-14T14:59:07.282871522Z",
  "severity": "ERROR",
  "labels": {
    "k8s-pod/type": "glaredb",
    "k8s-pod/id": "8ea6a5f9-82a5-4584-bbee-3bc4f75bccdc",
    "k8s-pod/pod-template-hash": "57974c46fd",
    "k8s-pod/organization": "14f8e619-c760-4edb-9021-e2f11b8f8150",
    "k8s-pod/name": "glaredb-8ea6a5f9-82a5-4584-bbee-3bc4f75bccdc",
    "compute.googleapis.com/resource_name": "gk3-glaredb-gke-public-nap-765xd42g-e66161ba-moe1"
  },
  "logName": "projects/glaredb-dev-playground/logs/stderr",
  "receiveTimestamp": "2023-09-14T14:59:08.688448716Z"
}

scsmithr avatar Sep 14 '23 15:09 scsmithr

This could be an issue in dd in fetch_parquet_metadata: https://github.com/apache/arrow-datafusion/blob/58fc80eddaf93d4c6399e4ac4fc1234649b26e43/datafusion/core/src/datasource/file_format/parquet.rs#L408-L470

Or an issue with how we construct object meta for http sources: https://github.com/glaredb/glaredb/blob/9e0e4b3dd8b196d151c5515cb58fe91273fa4f91/crates/datasources/src/object_store/http.rs#L64-L84

scsmithr avatar Sep 14 '23 15:09 scsmithr

@scsmithr does this only fail when connected to cloud? I can run the query locally (albeit very slowly).

universalmind303 avatar Sep 19 '23 16:09 universalmind303

The size of ObjectMeta that we are setting using the content_length seems to be wrong.

From the above error log "slice index starts at 18446744073707112300 but ends at 14779976438" , 14779976438 is the size we are setting in the ObjectMeta for this file, content_length can be different in cases where compression is used.

adhish20 avatar Oct 03 '23 14:10 adhish20

Were you able to reproduce the panic? I tried on cloud, and only get OOM. Running locally gets me:

Error: External error: Parquet error: Parquet error: Could not parse metadata: protocol error

content_length seems to be wrong

What's the content length we're getting back? Does that match the size of the file on disk? There's no compression being used here, since it's just a binary stream.

If using content length was wrong (in general), queries like select * from 'https://github.com/GlareDB/glaredb/raw/main/testdata/parquet/userdata1.parquet'; wouldn't work.

Also 18446744073707112300 indicates an underflow.

scsmithr avatar Oct 03 '23 14:10 scsmithr

Content length (through wget): 14779976446 Size on disk: 14779976446

scsmithr avatar Oct 03 '23 14:10 scsmithr

I get the same error locally. Also content_length is 14779976446.

Still trying to fully reproduce it, it takes around 20-25 mins to run once.

adhish20 avatar Oct 03 '23 14:10 adhish20

Size of the file is approx 15Gb, so all that adds up wrt size.

looking at how that underflow is generated and being used now

adhish20 avatar Oct 03 '23 14:10 adhish20

it takes around 20-25 mins to run once.

(Aside: Do we have a way to set a timeout for statements, something like https://www.postgresql.org/docs/current/runtime-config-client.html#GUC-STATEMENT-TIMEOUT)

greyscaled avatar Oct 03 '23 15:10 greyscaled

It seems to me, based on the message, and looking at the implementation of dispatch_access (and the discussion here!) that "we have a big file that's slow to download, hits a timeout, we have [less than the actual size of the file, but we read the the metadata before so we know how long it should be]," and rather than catch the error at that point ~~we attempt to read the parquet file, and end up with an underflow/location that doesn't exist.~~

The panic at datafusion/src/datasource/file_format/parquet.rs:467:14 is where datafusion is trying to get the parquet metadata (from the end of the file, which we haven't gotten yet (presumably, because it times out). https://github.com/apache/arrow-datafusion/blob/main/datafusion/core/src/datasource/file_format/parquet.rs#L455-L474 -- the panic is in the else statement, because [reason], presumably because we have some data (enough to get to the ELSE) and not all the data (for this to be correct?) If this is the case, there's probably some place where we start reading the file before we have confirmed that we haven't reached a timeout. Not sure I know enough about the specific plumbing to find the case where this would be.

I spent a little (but not very much time poking through datafusion) and it seems to me that:

  • the DF code could be a little more defensive and make sure that the metadata it was trying to read was valid before it attempted to slice things (maybe reasonable, maybe hard to do perfectly.)
  • I could be wrong about the timeout (the error in the original gives us a hard-to-understand stack of errors, and I'm not sure if the timeout/cancel is in the gRPC level (and propagated down to the objectstore/df components) or something that happens at the objectstore/df level and is propagated up to grpc.)
  • https://github.com/apache/arrow-rs/blob/master/object_store/src/lib.rs#L361-L365 <- this makes it seem like we're using a non-atomic handler in our HTTP table provider (but this might be a distraction?)

I think we could probably reproduce this on a shorter loop by artificially lowering timeouts in a test environment, or (potentially) by construction a file with a specific kind of corruption. Some theories could probably be validated/rejected that way.

tychoish avatar Oct 03 '23 17:10 tychoish

~~bug is in datafusion, where it tries to infer the schema and panics when it fails to slice it with that underflow.~~

adhish20 avatar Oct 04 '23 12:10 adhish20

The bug is in here : https://github.com/apache/arrow-rs/blob/f0455d12ddcb174f1f8d2bbfd5874f7b708c9a74/object_store/src/lib.rs#L780C5-L809

instead of returning the correct range, it returns the whole object. So we will need to push the fix upstream for this one.

adhish20 avatar Oct 04 '23 17:10 adhish20

@adhish20 @scsmithr FYI, there were some changes upstream in object_store that makes this specific query flaky.

https://datasets.clickhouse.com/hits_compatible/hits.parquet sometimes fulfills a range request, other times it does not. It seems to be non-deterministic to when it will & won't. Object store now errors if the range request was not fulfilled.

universalmind303 avatar Oct 05 '23 15:10 universalmind303

@adhish20 @scsmithr FYI, there were some changes upstream in object_store that makes this specific query flaky.

https://datasets.clickhouse.com/hits_compatible/hits.parquet sometimes fulfills a range request, other times it does not. It seems to be non-deterministic to when it will & won't. Object store now errors if the range request was not fulfilled.

Ah yes, I was able to reproduce and see the non-deterministic behaviour by pulling in your change that throws on a range query if the server can't fulfil it.

adhish20 avatar Oct 05 '23 15:10 adhish20

@universalmind303 Should we mark this as closed then?

adhish20 avatar Oct 05 '23 15:10 adhish20

Ah yes, I was able to reproduce and see the non-deterministic behaviour by pulling in your change that throws on a range query if the server can't fulfil it.

specifically this block of code. If it returns a 200 instead of the expected 206, object store errors out.

https://github.com/apache/arrow-rs/blob/f0455d12ddcb174f1f8d2bbfd5874f7b708c9a74/object_store/src/http/client.rs#L320C1-L328C10

universalmind303 avatar Oct 05 '23 15:10 universalmind303

As discussed on the dev call today:

  • [ ] turn this panic into an error message: "upstream http service returned unexpected results, contact your provider"
  • [ ] retry the operation a couple of times. [hypothesis is that caching/CDNs mean that sometimes things behave weirdly, and retires will fix this test and probably most other "200 not 206"

tychoish avatar Oct 09 '23 15:10 tychoish