duckdb icon indicating copy to clipboard operation
duckdb copied to clipboard

[Java] IO Error: Connection error for HTTP HEAD

Open NewtonVan opened this issue 1 year ago • 17 comments

What happens?

COPY (
    SELECT * FROM read_json_auto(${s3_src_file}, format='newline_delimited')
) TO ${s3_dst_file} (FORMAT PARQUET);

It's ok to run in duckdb and demo(use the same api with program has issues). In our own program, we use aws java api to upload file, and duckdb is used to help us achieve a row to column operation. And it report following exception.

Caused by: java.sql.SQLException: Invalid Error: IO Error: Connection error for HTTP HEAD to '<src>.json'
	at org.duckdb.DuckDBNative.duckdb_jdbc_prepare(Native Method)
	at org.duckdb.DuckDBPreparedStatement.prepare(DuckDBPreparedStatement.java:106)

To Reproduce

I'm sorry it's quite hard to reproduce. The same command runs well in both duckdb and UT. The only similar issue I found is #9232 , #9647 . A reasonable guess is that our s3 client occupy some connection resource and duckdb reach its limit.

OS:

macOS, Linux

DuckDB Version:

0.9.2

DuckDB Client:

java

Full Name:

yanhui chen

Affiliation:

ApeCloud

Have you tried this on the latest main branch?

I have tested with a main build

Have you tried the steps to reproduce? Do they include all relevant data and configuration? Does the issue you report still appear there?

  • [X] Yes, I have

NewtonVan avatar Dec 21 '23 01:12 NewtonVan

Is there any options for me to control duckdb's connection configuration?

NewtonVan avatar Dec 21 '23 03:12 NewtonVan

@NewtonVan There's the keep alive option which might work as a workaround (https://github.com/duckdb/duckdb/pull/9648)

To use it, run SET http_keep_alive=false in duckdb

edit: I just realized this feature was not included in 0.9.2 so this will only be available in the dev builds of duckdb currently

samansmink avatar Jan 03 '24 10:01 samansmink

@NewtonVan quick question: does the ${s3_src_file} contain a glob pattern? If so, how many files are there and how many rows per json file are there?

samansmink avatar Jan 05 '24 10:01 samansmink

Facing the same issue when using Python

import duckdb

con = duckdb.connect()
con.install_extension("httpfs")
con.load_extension("httpfs")
con.execute("SELECT * FROM 'https://raw.githubusercontent.com/duckdb/duckdb-web/main/data/weather.csv';")

Failed after about 300 seconds of execution

IO Error: SSLConnection error for HTTP HEAD to 'https://raw.githubusercontent.com/duckdb/duckdb-web/main/data/weather.csv'

Python 3.10
duckdb                    0.8.1
duckdb-engine             0.9.2

daviddwlee84 avatar Jan 12 '24 03:01 daviddwlee84

@samansmink http_keep_alive isn't available in the latest dev version either (in python).

erezinman avatar Jan 18 '24 21:01 erezinman

@samansmink It seems that the problem is greater than that. It seems that the connections are not closed even when the connection class is collected in python. This error persists even if a new duckdb connection is created on every usage.

erezinman avatar Jan 18 '24 22:01 erezinman

Another observation: This error has to do with the number of threads. On a thread that had previously created a (single) connection, a single file read (in my case using read_parquet) will raise not exception, but reading multiple files will raise it.

erezinman avatar Jan 18 '24 22:01 erezinman

@samansmink http_keep_alive isn't available in the latest dev version either (in python).

could you try:

python3 -m pip install duckdb==0.9.3.dev2934
Python 3.11.6 (main, Oct  2 2023, 20:46:14) [Clang 14.0.3 (clang-1403.0.22.14.1)] on darwin
>>> import duckdb
>>> con = duckdb.connect()
>>> con.query('set http_keep_alive=false;')
>>>

sometimes the extensions for pre-releases aren't uploaded yet, but this should work

Could you provide us with a reproducible script? Right now im not really sure how to reproduce this. I tried to run the python query you provided for an extended time, but I could not get it to fail.

samansmink avatar Jan 22 '24 09:01 samansmink

@samansmink

I might try to reproduce it later. It happened inside a very complex system. It happened inside a thread pool 's operation (~30 threads), and after reading ~100 parquet files from a GCS backend using the httpfs.

erezinman avatar Jan 22 '24 11:01 erezinman

We see this trying to read parquet glob from S3 too so I don't think it's just JSON

cddr avatar Jan 24 '24 14:01 cddr

Facing the same issue when doing read_parquet glob on GCP bucket.

mustafahasankhan avatar Feb 06 '24 11:02 mustafahasankhan

Folks, I was able to reproduce the issue when there are 1200+ parquet files in a hive partitioned prefix but not when there are 600 files. Note: I just copied the same file even then it raises same error. The file has 10K records and is 20MB in size.

mustafahasankhan avatar Feb 08 '24 13:02 mustafahasankhan

I have done a little more investigating and I can definitely reproduce this, but only when I increase the thread count to pretty high numbers (~300). I guess this is expected behaviour though and not really a bug. Also disabling http_keep_alive completely solves the issues afaict.

If people have a reproduction of this with relatively low number of files and threads, im very interested, ideally I would get:

  • a data set which can easily trigger this
  • the thread count required to trigger this

Otherwise some detailed info on the dataset is also helpful:

  • in case of parquet especially: row_group_size, tuples per file, total files, number of columns

oh and please test on v0.10.0 :)

@mustafahasankhan do you know at what thread count this occurs and whether set http_keep_alive=false;, which is now available, resolves the issue?

samansmink avatar Mar 01 '24 16:03 samansmink

@samansmink I was able to reproduce it with thread count of even 1 with 1200+ files. Yes set http_keep_alive=false; seems to resolve the issue with v0.10.0

mustafahasankhan avatar Mar 04 '24 09:03 mustafahasankhan

@mustafahasankhan thanks, that's interesting, that definitely sounds like a bug.

Could you give me access to this dataset? If not, could you give me some more details with which I can reproduce this? You can ping me at [email protected]

samansmink avatar Mar 04 '24 10:03 samansmink

For those using MinIO in Docker, use port 9000:9000 instead of 9000:9001.

docker run -p 9000:9000 -p 9090:9090 --name minio_name -e "MINIO_ROOT_USER=minio" -e "MINIO_ROOT_PASSWORD=minio123" -v ${HOME}/minio/data:/data quay.io/minio/minio server /data --console-address ":9090"

In my case it worked.

chlimaferreira avatar May 22 '24 13:05 chlimaferreira