bug: SQL Error [1046]: Query failed (#): Invalid Parquet file. Size is smaller than footer.
Search before asking
- [X] I had searched in the issues and found no similar issues.
Version
nightly
What's Wrong?
select * from 'https://domain.name/test.parquet' ended up with below error. The same query works well on both DuckDB and ClickHouse.
SQL Error [1046]: Query failed (#): Invalid Parquet file. Size is smaller than footer.
How to Reproduce?
Issue query select * from 'https://domain.name/test.parquet' using latest JDBC driver against nightly build. Make sure the web server only respond 200(without header like Content-Length) to HEAD requests:
curl -I -v 'https://domain.name/test.parquet'
...
> HEAD /test.parquet HTTP/1.1
> User-Agent: curl/7.29.0
> Host: domain.name
> Accept: */*
>
< HTTP/1.1 200 OK
HTTP/1.1 200 OK
< Date: Wed, 06 Mar 2024 07:48:06 GMT
Date: Wed, 06 Mar 2024 07:48:06 GMT
FYI, here https://domain.name/test.parquet is NOT a static file. The content is generated for each GET request backed by a short-lived cache. Would be great if Databend can still query parquet file without knowing its size in advance.
Are you willing to submit PR?
- [ ] Yes I am willing to submit a PR!
Would be great if Databend can still query parquet file without knowing its size in advance.
Currently, select from uri depends on the content length response.
2 choices:
- we trait http specially, read it as a stream
- report error when no Content-Length header, but we are not sure about it with the opendal interface @Xuanwo
pub fn content_length(&self) -> u64 {
debug_assert!(
self.metakey.contains(Metakey::ContentLength)
|| self.metakey.contains(Metakey::Complete),
"visiting not set metadata: content_length, maybe a bug"
);
self.content_length.unwrap_or_default()
}
We can't support reading parquet without knowing it's length since we should read from the end to get it's metadata.
We can't support reading parquet without knowing it's length since we should read from the end to get it's metadata.
yes๏ผ although we can read the whole file into mem first
but maybe we do not want in copy๏ผand require some changes.
and for querying stage๏ผcurrently need to read schema for binding