duckdb_spatial icon indicating copy to clipboard operation
duckdb_spatial copied to clipboard

`st_read` continually fails if first attempt was unable to connect to target host

Open galloviolet opened this issue 1 year ago • 11 comments

Below is a minimal repro of the issue I am seeing with the Python SQLAlchemy DBAPI when attempting to use st_read to load Excel files. In this case I'm using in-memory DuckDB, with a local SeaweedFS S3 store which runs as a webservice in a Docker container.

import sqlalchemy

connect_args = {'config': {'s3_region': 'myregion', 's3_access_key_id': 'mykey', 's3_secret_access_key': 'mysecret', 's3_endpoint': 'hostname:port', 's3_url_style': 'path', 's3_use_ssl': False}}
engine = sqlalchemy.create_engine('duckdb:///:memory:', connect_args=connect_args)
# manually bring down the s3 host
try:
    with engine.connect() as conn:
        conn.execute("INSTALL spatial; LOAD spatial;")
        result = conn.execute("SELECT * FROM st_read('http://hostname:port/bucket/file.xlsx', layer='sheet1')") # IO Error: GDAL Error (1): Could not resolve host: hostname
        # subsequent identical conn.execute here also gives the GDALOpen() called recursively error
finally:
    engine.dispose()
# manually start the s3 service at host:port where bucket/file.xlsx exists
try:
    with engine.connect() as conn:
        conn.execute("INSTALL spatial; LOAD spatial;")
        result = conn.execute("SELECT * FROM st_read('http://hostname:port/bucket/file.xlsx', layer='sheet1')") # IO Error: GDAL Error (1): GDALOpen() called on http://hostname:port/bucket/file.xlsx recursively
finally:
    engine.dispose()

What I am trying to illustrate here is that, if st_read fails to connect to the target host on the first attempt, all subsequent attempts will also fail even if the host becomes responsive and the file exists in the location specified. In other words, without simulating a network outage by manually bringing down my S3 service and bringing it back up after the first failed attempt, all of these st_read calls work fine.

To me it's particularly problematic that this still occurs if I call engine.dispose() because in that case I expect a "fresh" connection without knowledge of the past failure. But even without engine.dispose() I would prefer st_read to try again and not throw the "called recursively" error in order to handle (common) real-world cases where connectivity was temporarily lost and we want to try again.

I'm seeing this on duckdb version 0.9.1, but have observed the same behavior on 0.9.3-dev1996.

galloviolet avatar Dec 28 '23 22:12 galloviolet

Hi! Thanks for reporting this issue! The filesystem abstractions have gone through a lot of changes since the last release and latest development release was issued and I suspect this has already been fixed. See eg #182 #218 for related excel issues.

You can get the latest spatial version (built every time a PR is merged to main in this repo) for DuckDB 0.9.1 by executing:

FORCE INSTALL spatial FROM 'http://nightly-extensions.duckdb.org';

Note that this is a newer version than the one you get if you use DuckDB 0.9.3.

Also, do you have the httpfs extension installed?

Maxxen avatar Dec 29 '23 09:12 Maxxen

Also, do you have the httpfs extension installed?

Based on the docs I didn't think there was a need to install it explicitly. Is that incorrect? Besides this issue, I am in general able to connect to S3 using st_read similar to my example above.

I went ahead and explicitly installed httpfs as well as the latest version of spatial as you suggested, and I'm now seeing a different issue. The first attempt at st_read returns an "invalid file type" error, and then all subsequent attempts give a "GDALOpen() called on [file] recursively" error. This is all with the S3 host live and able to receive connections, I never brought it down. Here is the new code example:

import sqlalchemy

connect_args = {'config': {'s3_region': 'myregion', 's3_access_key_id': 'mykey', 's3_secret_access_key': 'mysecret', 's3_endpoint': 'hostname:port', 's3_url_style': 'path', 's3_use_ssl': False}}
engine = sqlalchemy.create_engine('duckdb:///:memory:', connect_args=connect_args)
try:
    with engine.connect() as conn:
        conn.execute("INSTALL httpfs; LOAD httpfs;")
        conn.execute("FORCE INSTALL spatial FROM 'http://nightly-extensions.duckdb.org'; LOAD spatial;")
        conn.execute("SELECT * FROM st_read('http://hostname:port/bucket/file.xlsx', layer='sheet1')") # IO Error: Unknown file type
        conn.execute("SELECT * FROM st_read('http://hostname:port/bucket/file.xlsx', layer='sheet1')") # IO Error: GDAL Error (1): GDALOpen() called on http://hostname:port/bucket/file.xlsx recursively
finally:
    engine.dispose()

galloviolet avatar Dec 29 '23 17:12 galloviolet

Alright, thanks for the thorough investigation! In general you do/will need the httpfs extension afaik it doesnt autoload properly when spatial tries to issue a remote call, but that might also have changed. Ill look into this properly when im back from vacation in a couple of weeks!

Maxxen avatar Dec 29 '23 18:12 Maxxen

So im investigating this more now. My working theory is that the "called recursively" errors is due to us just being a little too aggressive when erroring and throwing exceptions when we can't determine the file type, which doesn't allow GDAL to do the proper cleaning up, which causes the "called recursively" errors. I think I have a fix for this as its pretty simple.

What Im still trying to figure out is why you get the unknown file type error, I feel like we got all the cases covered there, but there must be something rewriting the path so that its no longer recognizable as a remote file. Are you on Windows perhaps?

Maxxen avatar Jan 10 '24 14:01 Maxxen

No, I am running Arch Linux with kernel version "Linux 6.6.10-arch1-1".

galloviolet avatar Jan 10 '24 17:01 galloviolet

Also, looking into this again just now, I still see the "unknown file type" error on duckdb 0.9.1 together with the latest version of spatial from nightly. But now, if I use pip to upgrade duckdb to 0.9.3.dev2452 I instead see a failure when trying to run FORCE INSTALL spatial FROM 'http://nightly-extensions.duckdb.org'; LOAD spatial;.

This may be expected as you work on a resolution, but I just thought I'd point it out.

E               sqlalchemy.exc.OperationalError: (duckdb.duckdb.HTTPException) HTTP Error: Failed to download extension "spatial" at URL "http://nightly-extensions.duckdb.org/8793ff3a2e/linux_amd64_gcc4/spatial.duckdb_extension.gz"
E               Extension "spatial" is an existing extension.
E               
E               Are you using a development build? In this case, extensions might not (yet) be uploaded.
E               [SQL: FORCE INSTALL spatial FROM 'http://nightly-extensions.duckdb.org'; LOAD spatial;]
E               (Background on this error at: https://sqlalche.me/e/14/e3q8)

galloviolet avatar Jan 10 '24 22:01 galloviolet

One more happier observation. The same set of steps actually worked without issue on duckdb 0.9.2 together with the latest spatial from nightly.

Edit: it even works with the default version of spatial I get from just "INSTALL spatial". If I try to load a nonexistent file like "file.abcdef" then the second failure is the "recursive" error on 0.9.1, but not on 0.9.2, regardless of whether I use the default or nightly version of spatial.

galloviolet avatar Jan 10 '24 22:01 galloviolet

Hmm, I think maybe you needed to reload to have the new extension apply. Nevertheless I've fixed the issue with getting stuck on the recursive errors in #227, so if this doesn't occur again I think we can close this as fixed.

Maxxen avatar Jan 11 '24 10:01 Maxxen

Not sure if I totally understand what you mean by needing to reload, but I still see the same error about Extension "spatial" is an existing extension if I try the test again today on duckdb-0.9.3.dev2490. As long as that is, as the error suggests, just an issue with the extension not being uploaded quite yet to that newest build, I agree we should be okay.

galloviolet avatar Jan 11 '24 17:01 galloviolet

Hello!

As of DuckDB 1.2.0 the duckdb excel extension provides support for reading and writing xlsx files with greater efficiency and many more options. Therefore we're going to deprecate this feature in the spatial extension in the future. I would recommend that you try out the excel extension, and if you run into any problems please open an issue over at the duckdb-excel repository.

Maxxen avatar Feb 05 '25 21:02 Maxxen

Oh shit, Im closing a bunch of old xlsx related issues but didn't realize that this was about something more, my bad!

Maxxen avatar Feb 05 '25 21:02 Maxxen