gdal icon indicating copy to clipboard operation
gdal copied to clipboard

Intermittent connection failure "CURL_INFO_TEXT: error setting certificate file" while reprojecting GTI

Open underchemist opened this issue 2 weeks ago • 2 comments

What is the bug?

Possibly related to #12933 at least with respect to retry logic for ReadMultiRange not available in multithreaded context.

I have several GTI files referencing anywhere from ~100 to ~9k COG assets in s3. In the process of reprojecting the GTI to a large mosaic GTiff, I occasionally observe a failure which leads to the process exiting and needing to restart the process.

Example command single threaded

gdal raster reproject \
        --config CPL_DEBUG=ON \
        --config CPL_LOG=/dev/stderr \
        --config CPL_CURL_VERBOSE=YES \
        --config GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES \
        --config GDAL_HTTP_MAX_TOTAL_CONNECTIONS=24 \
        --config GDAL_HTTP_MAX_CACHED_CONNECTIONS=5 \
        --config GDAL_HTTP_MULTIPLEX=YES \
        --config GDAL_HTTP_MULTIRANGE=YES \
        --config GDAL_HTTP_VERSION=1.1 \
        --config GDAL_HTTP_MAX_RETRY=3 \
        --config GDAL_HTTP_RETRY_DELAY=3 \
        --config GDAL_HTTP_RETRY_CODES=429,500,502,503,504,0 \
        --config GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR \
        --config GDAL_MAX_DATASET_POOL_SIZE=512 \
        --config GDAL_MAX_DATASET_POOL_RAM_USAGE=50% \
        --config VSI_CACHE=TRUE \
        --config VSI_CACHE_SIZE=5MB \
        --config GDAL_CACHEMAX=25% \
        --config GDAL_NUM_THREADS=1 \
        --config CPL_VSIL_CURL_CACHE_SIZE=200MB \
        --config CPL_MAX_ERROR_REPORTS=10 \
        --resolution 1,1 \
        --dst-crs "EPSG:3857" \
        --progress \
        --bbox $bbox \
        --output-format GTiff \
        --co SPARSE_OK=YES \
        --co COMPRESS=DEFLATE \
        --co PREDICTOR=2 \
        --co INTERLEAVE=pixel \
        --co BLOCKXSIZE=512 \
        --co BLOCKYSIZE=512 \
        --co TILED=YES \
        --co BIGTIFF=YES \
        --wo NUM_THREADS=1 \
        --oo LAYER=index \
        "GTI:${inputfile}" \
        $tiff_raw_name

snippets of successful vsicurl requests (this example is from a case where GDAL_NUM_THREADS=ALL_CPUS)

CURL_INFO_TEXT: Couldn't find host s3.us-west-2.amazonaws.com in the .netrc file; using defaults
CURL_INFO_TEXT: Connection 59 seems to be dead
CURL_INFO_TEXT: Closing connection
CURL_INFO_TEXT: TLSv1.3 (IN), TLS alert, close notify (256):
CURL_INFO_TEXT: TLSv1.3 (OUT), TLS alert, close notify (256):
CURL_INFO_TEXT: Host s3.us-west-2.amazonaws.com:443 was resolved.
CURL_INFO_TEXT: IPv6: (none)
CURL_INFO_TEXT: IPv4: 52.218.236.80, 52.92.133.144, 16.12.89.88, 52.218.242.192, 52.92.203.136, 52.92.136.200, 52.92.144.216, 52.218.253.144
CURL_INFO_TEXT:   Trying 52.218.236.80:443...
CURL_INFO_TEXT: Connected to s3.us-west-2.amazonaws.com (52.218.236.80) port 443
CURL_INFO_TEXT: ALPN: curl offers h2,http/1.1
CURL_INFO_TEXT: TLSv1.3 (OUT), TLS handshake, Client hello (1):
CURL_INFO_TEXT:  CAfile: /etc/ssl/certs/ca-certificates.crt
CURL_INFO_TEXT:  CApath: /etc/ssl/certs
CURL_INFO_TEXT: TLSv1.3 (IN), TLS handshake, Server hello (2):
CURL_INFO_TEXT: TLSv1.3 (IN), TLS handshake, Encrypted Extensions (8):
CURL_INFO_TEXT: TLSv1.3 (IN), TLS handshake, Certificate (11):
CURL_INFO_TEXT: TLSv1.3 (IN), TLS handshake, CERT verify (15):
CURL_INFO_TEXT: TLSv1.3 (IN), TLS handshake, Finished (20):
CURL_INFO_TEXT: TLSv1.3 (OUT), TLS change cipher, Change cipher spec (1):
CURL_INFO_TEXT: TLSv1.3 (OUT), TLS handshake, Finished (20):
CURL_INFO_TEXT: SSL connection using TLSv1.3 / TLS_AES_128_GCM_SHA256 / X25519 / RSASSA-PSS
CURL_INFO_TEXT: ALPN: server accepted http/1.1
CURL_INFO_TEXT: Server certificate:
CURL_INFO_TEXT:  subject: CN=*.s3-us-west-2.amazonaws.com
CURL_INFO_TEXT:  start date: Jul 16 00:00:00 2025 GMT
CURL_INFO_TEXT:  expire date: Jun 27 23:59:59 2026 GMT
CURL_INFO_TEXT:  subjectAltName: host "s3.us-west-2.amazonaws.com" matched cert's "s3.us-west-2.amazonaws.com"
CURL_INFO_TEXT:  issuer: C=US; O=Amazon; CN=Amazon RSA 2048 M01
CURL_INFO_TEXT:  SSL certificate verify ok.
CURL_INFO_TEXT:   Certificate level 0: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
CURL_INFO_TEXT:   Certificate level 1: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
CURL_INFO_TEXT:   Certificate level 2: Public key type RSA (2048/112 Bits/secBits), signed using sha256WithRSAEncryption
CURL_INFO_TEXT: using HTTP/1.x
CURL_INFO_HEADER_OUT: GET /bucket/to/cog.tif HTTP/1.1
Host: s3.us-west-2.amazonaws.com
User-Agent: GDAL/3.13.0
Accept: */*
Range: bytes=0-16383
X-Amz-Security-Token: 123
x-amz-date: 20251209T033137Z
x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Authorization: AWS4-HMAC-SHA256 Credential=ASIA25LBZOTSPRMH5SNB/20251209/us-west-2/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token,Signature=ade706f90225f326f9e78db9d3d84d2476da03887e5830679242729a710b0acd

CURL_INFO_HEADER_IN: HTTP/1.1 206 Partial Content
CURL_INFO_HEADER_IN: x-amz-id-2: jT+HTCxPMRLdoefuFbSuY9gHnxJXHapftE8VIsroqoT1NUJUgn/SoMC+n5Ml/s+1RhodUJypQNQ=
CURL_INFO_HEADER_IN: x-amz-request-id: A2CH4299NZH89ENW
CURL_INFO_HEADER_IN: Date: Tue, 09 Dec 2025 03:31:38 GMT
CURL_INFO_HEADER_IN: Last-Modified: Fri, 30 Aug 2024 17:03:45 GMT
CURL_INFO_HEADER_IN: ETag: "e8f94f14e0dfe97ac16d370435e54770"
CURL_INFO_HEADER_IN: x-amz-server-side-encryption: AES256
CURL_INFO_HEADER_IN: Accept-Ranges: bytes
CURL_INFO_HEADER_IN: Content-Range: bytes 0-16383/118492
CURL_INFO_HEADER_IN: Content-Type: binary/octet-stream
CURL_INFO_HEADER_IN: Content-Length: 16384
CURL_INFO_HEADER_IN: Server: AmazonS3
CURL_INFO_HEADER_IN: 
CURL_INFO_TEXT: Connection #60 to host s3.us-west-2.amazonaws.com left intact
S3: GetFileSize(https://s3.us-west-2.amazonaws.com/bucket/to/cog.tif)=118492  response_code=206
S3: Downloading 16384-32767 (https://s3.us-west-2.amazonaws.com/bucket/to/cog.tif)...
CURL_INFO_TEXT: Couldn't find host s3.us-west-2.amazonaws.com in the .netrc file; using defaults
CURL_INFO_TEXT: Found bundle for host: 0x58a979846180 [serially]
CURL_INFO_TEXT: Can not multiplex, even if we wanted to
CURL_INFO_TEXT: Re-using existing connection with host s3.us-west-2.amazonaws.com
CURL_INFO_HEADER_OUT: GET /bucket/to/cog.tif HTTP/1.1
Host: s3.us-west-2.amazonaws.com
User-Agent: GDAL/3.13.0
Accept: */*
Range: bytes=16384-32767
X-Amz-Security-Token: 123
x-amz-date: 20251209T033137Z
x-amz-content-sha256: e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855
Authorization: AWS4-HMAC-SHA256 Credential=ASIA25LBZOTSPRMH5SNB/20251209/us-west-2/s3/aws4_request,SignedHeaders=host;x-amz-content-sha256;x-amz-date;x-amz-security-token,Signature=ade706f90225f326f9e78db9d3d84d2476da03887e5830679242729a710b0acd

CURL_INFO_HEADER_IN: HTTP/1.1 206 Partial Content
CURL_INFO_HEADER_IN: x-amz-id-2: jMHUd/+67X2UVFMV9EF+U8PR3QHO4PT9e/F8JCXdSX3zZPaYB97TPsBZJ88ch4l3v3j3XbXeZyQ=
CURL_INFO_HEADER_IN: x-amz-request-id: A2CGMP1BPCJWQ9WX
CURL_INFO_HEADER_IN: Date: Tue, 09 Dec 2025 03:31:38 GMT
CURL_INFO_HEADER_IN: Last-Modified: Fri, 30 Aug 2024 17:03:45 GMT
CURL_INFO_HEADER_IN: ETag: "e8f94f14e0dfe97ac16d370435e54770"
CURL_INFO_HEADER_IN: x-amz-server-side-encryption: AES256
CURL_INFO_HEADER_IN: Accept-Ranges: bytes
CURL_INFO_HEADER_IN: Content-Range: bytes 16384-32767/118492
CURL_INFO_HEADER_IN: Content-Type: binary/octet-stream
CURL_INFO_HEADER_IN: Content-Length: 16384
CURL_INFO_HEADER_IN: Server: AmazonS3
CURL_INFO_HEADER_IN: 
CURL_INFO_TEXT: Connection #60 to host s3.us-west-2.amazonaws.com left intact
S3: Got response_code=206
GTiff: Using up to 32 threads for compression/decompression
GDAL: GDALOpen(/vsis3/bucket/to/cog.tif, this=0x58aa2d4e5900) succeeds as GTiff.
VRT: Tile /vsis3/bucket/to/cog.tif has not the same SRS as the VRT. Proceed to on-the-fly warping
GDAL: Computing area of interest: -95.7427, 47.8008, -95.7377, 47.8027
GDAL: GDALDriver::Create(VRT,,743,417,1,Byte,0x58aa64ac3380)
WARP: Copying metadata from first source to destination dataset
GTiff: ScanDirectories()
GTiff: Opened 365x195 overview.
WARP: srcNoData=0.000000 dstNoData=0.000000
WARP: calling GDALSetRasterNoDataValue() for band#0
GDAL: Computing area of interest: -95.7427, 47.8008, -95.7377, 47.8027
GDAL: GDALDriver::Create(VRT,,774,435,2,Byte,0x58aa64ac3200)
WARP: Copying metadata from first source to destination dataset
WARP: SetAlphaMax: AlphaMax not set.
GDAL: GDALWarpKernel()::GWKNearestByte() Src=0,1155,451x141 Dst=0,1280,512x128
WARP: Using 1 threads
GDAL: GDALWarpKernel()::GWKNearestNoMasksOrDstDensityOnlyByte() Src=446,1172,497x141 Dst=512,1280,512x128
WARP: Using 1 threads
GDAL: GDALWarpKernel()::GWKNearestNoMasksOrDstDensityOnlyByte() Src=938,1190,497x141 Dst=1024,1280,512x128
WARP: Using 1 threads
GDAL: GDALWarpKernel()::GWKNearestByte() Src=0,1277,447x141 Dst=0,1408,512x128
WARP: Using 1 threads
GDAL: GDALWarpKernel()::GWKNearestNoMasksOrDstDensityOnlyByte() Src=441,1295,498x141 Dst=512,1408,512x128
WARP: Using 1 threads
GDAL: GDALWarpKernel()::GWKNearestNoMasksOrDstDensityOnlyByte() Src=933,1312,498x141 Dst=1024,1408,512x128
WARP: Using 1 threads

Snippets from where connection fails later on in the same process

S3: Downloading 298977-403163 (https://s3.us-west-2.amazonaws.com/bucket/to/cog.tif)...
CURL_INFO_TEXT: Couldn't find host s3.us-west-2.amazonaws.com in the .netrc file; using defaults
CURL_INFO_TEXT: Connection 0 seems to be dead
CURL_INFO_TEXT: Closing connection
CURL_INFO_TEXT: TLSv1.3 (IN), TLS alert, close notify (256):
CURL_INFO_TEXT: TLSv1.3 (OUT), TLS alert, close notify (256):
CURL_INFO_TEXT: Hostname s3.us-west-2.amazonaws.com was found in DNS cache
CURL_INFO_TEXT:   Trying 52.92.164.128:443...
CURL_INFO_TEXT: Connected to s3.us-west-2.amazonaws.com (52.92.164.128) port 443
CURL_INFO_TEXT: ALPN: curl offers h2,http/1.1
CURL_INFO_TEXT: TLSv1.3 (OUT), TLS handshake, Client hello (1):
CURL_INFO_TEXT: error setting certificate file: /etc/ssl/certs/ca-certificates.crt
CURL_INFO_TEXT: error setting certificate file: /etc/ssl/certs/ca-certificates.crt
CURL_INFO_TEXT: Closing connection
CURL_INFO_TEXT: error setting certificate file: /etc/ssl/certs/ca-certificates.crt
CURL_INFO_TEXT: error setting certificate file: /etc/ssl/certs/ca-certificates.crt
S3: ReadMultiRange(https://s3.us-west-2.amazonaws.com/bucket/to/cog.tif), 298977-403163: response_code=0, msg=error setting certificate file: /etc/ssl/certs/ca-certificates.crt
S3: Downloading 298977-403163 (https://s3.us-west-2.amazonaws.com/bucket/to/cog.tif)...
CURL_INFO_TEXT: Couldn't find host s3.us-west-2.amazonaws.com in the .netrc file; using defaults
CURL_INFO_TEXT: Hostname in DNS cache was stale, zapped
CURL_INFO_TEXT: getaddrinfo() thread failed to start
CURL_INFO_TEXT: Could not resolve host: s3.us-west-2.amazonaws.com
CURL_INFO_TEXT: Closing connection
S3: ReadMultiRange(https://s3.us-west-2.amazonaws.com/bucket/to/cog.tif), 298977-403163: response_code=0, msg=getaddrinfo() thread failed to start
S3: Downloading 298977-403163 (https://s3.us-west-2.amazonaws.com/bucket/to/cog.tif)...
CURL_INFO_TEXT: Couldn't find host s3.us-west-2.amazonaws.com in the .netrc file; using defaults
CURL_INFO_TEXT: getaddrinfo() thread failed to start
CURL_INFO_TEXT: Could not resolve host: s3.us-west-2.amazonaws.com
CURL_INFO_TEXT: Closing connection
S3: ReadMultiRange(https://s3.us-west-2.amazonaws.com/bucket/to/cog.tif), 298977-403163: response_code=0, msg=getaddrinfo() thread failed to start
S3: Downloading 298977-403163 (https://s3.us-west-2.amazonaws.com/bucket/to/cog.tif)...
CURL_INFO_TEXT: Couldn't find host s3.us-west-2.amazonaws.com in the .netrc file; using defaults
CURL_INFO_TEXT: getaddrinfo() thread failed to start
CURL_INFO_TEXT: Could not resolve host: s3.us-west-2.amazonaws.com
CURL_INFO_TEXT: Closing connection
S3: ReadMultiRange(https://s3.us-west-2.amazonaws.com/bucket/to/cog.tif), 298977-403163: response_code=0, msg=getaddrinfo() thread failed to start

I've played around with

  • GDAL_MAX_DATASET_POOL_SIZE
  • GDAL_HTTP_MAX_TOTAL_CONNECTIONS
  • GDAL_HTTP_MAX_CACHED_CONNECTIONS
  • *NUM_THREADS

but I'm not able to isolate the exact failure

When using NUM_THREADS=1 and GDAL_HTTP_RETRY_CODES=429,500,502,503,504,0 I do observe retries (whereas I don't as noted in #12933) but these retries still fail with error setting certificate file and thread failed to start.

Steps to reproduce the issue

Unfortunately I don't have a sample GTI I can share at the moment

Versions and provenance

Docker container based on ubuntu-full-latest

FROM ghcr.io/osgeo/gdal:ubuntu-full-latest

RUN apt update && apt upgrade -y \
  && apt install curl -y \
  && rm -rf /var/lib/apt/lists/*

RUN mkdir /usr/local/share/ca-certificates/cacert.org
RUN cd /usr/local/share/ca-certificates/cacert.org && curl -k -O https://www.cacert.org/certs/root.crt
RUN cd /usr/local/share/ca-certificates/cacert.org && curl -k -O https://www.cacert.org/certs/class3.crt
RUN update-ca-certificates
ENV CURL_CA_BUNDLE=/etc/ssl/certs/ca-certificates.crt

RUN apt update && apt install -y python3-pip && python3 -m pip install coiled dask distributed rio-cogeo --break-system-packages
RUN curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip" && unzip awscliv2.zip && ./aws/install
gdal --version
GDAL 3.13.0dev-c959ec2724ef45815b6484387c1894a56e75aacd, released 2025/12/07

Additional context

  • Most of the GTI files I'm able to reproject without issue (including the one with 9k assets)
  • The same GTI files are consistently affected, however the underlying assets for which a request is being made to is different. I'm able to read the underlying assets without issue in a separate process.

underchemist avatar Dec 09 '25 20:12 underchemist

Did you try GDAL_HTTP_MULTIRANGE=SERIAL ? I wouldn't expect it to make a difference as normally the VSICurlHandle::ReadMultiRange() code path is only taken if GeoTIFF multithreaded decoding is enabled, which you don't have The other explanation is that the process if running out of file descriptors. Normally GDAL_MAX_DATASET_POOL_SIZE=512 should be fine as you use GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR and thus only one file per dataset should be kept opened. Hard to debug with just the logs...

rouault avatar Dec 09 '25 23:12 rouault

Thanks for the suggestions @rouault I have tried both your suggestions previously and still experience issues. Will update if I can recreate a minimal working example.

underchemist avatar Dec 09 '25 23:12 underchemist

Some more details:

I modified my workflow to simply copy all the source files locally and then create a GTI file to skip vsicurl entirely. However I also observed a failure in this case. Most recently I set

  • GDAL_MAX_DATASET_POOL_SIZE=2

and monitored number of open file descriptors for the PID and observed a few things. Initially, ls /proc/$PID/fd | wc -l would return values between 5-15 during gdal raster reproject which I expected as this is not a strict count of open files via GDALProxyPool. After a few hours with the process still running I noticed the # of file descriptors for the process jumped now to +500. I'm not sure if that indicates that perhaps at some point GDAL_MAX_DATASET_POOL_SIZE is no longer being respected or there is some other mechanism that GDAL_MAX_DATASET_POOL_SIZE doesn't apply to that could lead to number of file descriptors increasing rapidly.

As a workaround, I've attempted to modify the 1024 fd limit (confirmed on my system by ulimit -n) via prlimit --pid $PID --nofile=10000:10000 and will report back if that is successful.

underchemist avatar Dec 12 '25 01:12 underchemist

and monitored number of open file descriptors for the PID and observed a few things. Initially, ls /proc/$PID/fd | wc -l would return values between 5-15 during gdal raster reproject which I expected as this is not a strict count of open files via GDALProxyPool

I've tried to replicate this with a GTI indexing 2000 dummy local rasters. The maximum number of file descriptors opened during the process capped at 6 + GDAL_MAX_DATASET_POOL_SIZE . There must be something specific to your exact sources and command line invokation that was missed by my test

rouault avatar Dec 12 '25 13:12 rouault