duckdb
duckdb copied to clipboard
[Java] IO Error: Connection error for HTTP HEAD
What happens?
COPY (
SELECT * FROM read_json_auto(${s3_src_file}, format='newline_delimited')
) TO ${s3_dst_file} (FORMAT PARQUET);
It's ok to run in duckdb and demo(use the same api with program has issues). In our own program, we use aws java api to upload file, and duckdb is used to help us achieve a row to column operation. And it report following exception.
Caused by: java.sql.SQLException: Invalid Error: IO Error: Connection error for HTTP HEAD to '<src>.json'
at org.duckdb.DuckDBNative.duckdb_jdbc_prepare(Native Method)
at org.duckdb.DuckDBPreparedStatement.prepare(DuckDBPreparedStatement.java:106)
To Reproduce
I'm sorry it's quite hard to reproduce. The same command runs well in both duckdb and UT. The only similar issue I found is #9232 , #9647 . A reasonable guess is that our s3 client occupy some connection resource and duckdb reach its limit.
OS:
macOS, Linux
DuckDB Version:
0.9.2
DuckDB Client:
java
Full Name:
yanhui chen
Affiliation:
ApeCloud
Have you tried this on the latest main
branch?
I have tested with a main build
Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?
- [X] Yes, I have
Is there any options for me to control duckdb's connection configuration?
@NewtonVan There's the keep alive option which might work as a workaround (https://github.com/duckdb/duckdb/pull/9648)
To use it, run SET http_keep_alive=false
in duckdb
edit: I just realized this feature was not included in 0.9.2 so this will only be available in the dev builds of duckdb currently
@NewtonVan quick question: does the ${s3_src_file}
contain a glob pattern? If so, how many files are there and how many rows per json file are there?
Facing the same issue when using Python
import duckdb
con = duckdb.connect()
con.install_extension("httpfs")
con.load_extension("httpfs")
con.execute("SELECT * FROM 'https://raw.githubusercontent.com/duckdb/duckdb-web/main/data/weather.csv';")
Failed after about 300 seconds of execution
IO Error: SSLConnection error for HTTP HEAD to 'https://raw.githubusercontent.com/duckdb/duckdb-web/main/data/weather.csv'
Python 3.10
duckdb 0.8.1
duckdb-engine 0.9.2
@samansmink http_keep_alive
isn't available in the latest dev
version either (in python).
@samansmink It seems that the problem is greater than that. It seems that the connections are not closed even when the connection class is collected in python. This error persists even if a new duckdb connection is created on every usage.
Another observation: This error has to do with the number of threads. On a thread that had previously created a (single) connection, a single file read (in my case using read_parquet
) will raise not exception, but reading multiple files will raise it.
@samansmink
http_keep_alive
isn't available in the latestdev
version either (in python).
could you try:
python3 -m pip install duckdb==0.9.3.dev2934
Python 3.11.6 (main, Oct 2 2023, 20:46:14) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
>>> import duckdb
>>> con = duckdb.connect()
>>> con.query('set http_keep_alive=false;')
>>>
sometimes the extensions for pre-releases aren't uploaded yet, but this should work
Could you provide us with a reproducible script? Right now im not really sure how to reproduce this. I tried to run the python query you provided for an extended time, but I could not get it to fail.
@samansmink
I might try to reproduce it later. It happened inside a very complex system. It happened inside a thread pool 's operation (~30 threads), and after reading ~100 parquet files from a GCS backend using the httpfs.
We see this trying to read parquet glob from S3 too so I don't think it's just JSON
Facing the same issue when doing read_parquet glob on GCP bucket.
Folks, I was able to reproduce the issue when there are 1200+ parquet files in a hive partitioned prefix but not when there are 600 files. Note: I just copied the same file even then it raises same error. The file has 10K records and is 20MB in size.
I have done a little more investigating and I can definitely reproduce this, but only when I increase the thread count to pretty high numbers (~300). I guess this is expected behaviour though and not really a bug. Also disabling http_keep_alive completely solves the issues afaict.
If people have a reproduction of this with relatively low number of files and threads, im very interested, ideally I would get:
- a data set which can easily trigger this
- the thread count required to trigger this
Otherwise some detailed info on the dataset is also helpful:
- in case of parquet especially: row_group_size, tuples per file, total files, number of columns
oh and please test on v0.10.0 :)
@mustafahasankhan do you know at what thread count this occurs and whether set http_keep_alive=false;
, which is now available, resolves the issue?
@samansmink I was able to reproduce it with thread count of even 1 with 1200+ files. Yes set http_keep_alive=false; seems to resolve the issue with v0.10.0
@mustafahasankhan thanks, that's interesting, that definitely sounds like a bug.
Could you give me access to this dataset? If not, could you give me some more details with which I can reproduce this? You can ping me at [email protected]
For those using MinIO in Docker, use port 9000:9000 instead of 9000:9001.
docker run -p 9000:9000 -p 9090:9090 --name minio_name -e "MINIO_ROOT_USER=minio" -e "MINIO_ROOT_PASSWORD=minio123" -v ${HOME}/minio/data:/data quay.io/minio/minio server /data --console-address ":9090"
In my case it worked.