databend icon indicating copy to clipboard operation
databend copied to clipboard

bug: SQL Error [1046]: Query failed (#): Invalid Parquet file. Size is smaller than footer.

Open zhicwu opened this issue 1 year ago โ€ข 4 comments

Search before asking

  • [X] I had searched in the issues and found no similar issues.

Version

nightly

What's Wrong?

select * from 'https://domain.name/test.parquet' ended up with below error. The same query works well on both DuckDB and ClickHouse.

SQL Error [1046]: Query failed (#): Invalid Parquet file. Size is smaller than footer.

How to Reproduce?

Issue query select * from 'https://domain.name/test.parquet' using latest JDBC driver against nightly build. Make sure the web server only respond 200(without header like Content-Length) to HEAD requests:

curl -I -v 'https://domain.name/test.parquet'
...
> HEAD /test.parquet HTTP/1.1
> User-Agent: curl/7.29.0
> Host: domain.name
> Accept: */*
> 
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Wed, 06 Mar 2024 07:48:06 GMT
Date: Wed, 06 Mar 2024 07:48:06 GMT

FYI, here https://domain.name/test.parquet is NOT a static file. The content is generated for each GET request backed by a short-lived cache. Would be great if Databend can still query parquet file without knowing its size in advance.

Are you willing to submit PR?

  • [ ] Yes I am willing to submit a PR!

zhicwu avatar Mar 06 '24 08:03 zhicwu

Would be great if Databend can still query parquet file without knowing its size in advance.

Currently, select from uri depends on the content length response.

sundy-li avatar Mar 10 '24 14:03 sundy-li

2 choices:

  1. we trait http specially, read it as a stream
  2. report error when no Content-Length header, but we are not sure about it with the opendal interface @Xuanwo
    pub fn content_length(&self) -> u64 {
        debug_assert!(
            self.metakey.contains(Metakey::ContentLength)
                || self.metakey.contains(Metakey::Complete),
            "visiting not set metadata: content_length, maybe a bug"
        );

        self.content_length.unwrap_or_default()
    }

youngsofun avatar May 31 '24 04:05 youngsofun

We can't support reading parquet without knowing it's length since we should read from the end to get it's metadata.

Xuanwo avatar May 31 '24 07:05 Xuanwo

We can't support reading parquet without knowing it's length since we should read from the end to get it's metadata.

yes๏ผŒ although we can read the whole file into mem first

but maybe we do not want in copy๏ผŒand require some changes.

and for querying stage๏ผŒcurrently need to read schema for binding

youngsofun avatar May 31 '24 08:05 youngsofun