vsicurl cannot handle streamed responses
Expected behavior and actual behavior.
Unable to open a remote resource that has been dynamically generated, using vsicurl.
Steps to reproduce the problem.
ogrinfo --debug on --config CPL_CURL_VERBOSE YES -oo X_POSSIBLE_NAMES=decimalLongitude -oo Y_POSSIBLE_NAMES=decimalLatitude 'CSV:/vsizip/{/vsicurl/https://ipt.nina.no/archive.do?r=arko_gel&v=1.12}/occurrence.txt'
Output:
HTTP: libcurl/8.0.1 OpenSSL/3.0.8 zlib/1.2.13 brotli/1.0.9 libidn2/2.3.4 libpsl/0.21.2 (+libidn2/2.3.4) libssh/0.10.4/openssl/zlib nghttp2/1.52.0
HTTP: GDAL was built against curl 7.87.0, but is running against 8.0.1.
CURL_INFO_TEXT: Couldn't find host ipt.nina.no in the (nil) file; using defaults
CURL_INFO_TEXT: Trying 158.38.174.15:443...
CURL_INFO_TEXT: Connected to ipt.nina.no (158.38.174.15) port 443 (#0)
CURL_INFO_TEXT: ALPN: offers h2,http/1.1
CURL_INFO_TEXT: TLSv1.3 (OUT), TLS handshake, Client hello (1):
CURL_INFO_TEXT: CAfile: /etc/pki/tls/certs/ca-bundle.crt
CURL_INFO_TEXT: CApath: none
CURL_INFO_TEXT: TLSv1.3 (IN), TLS handshake, Server hello (2):
CURL_INFO_TEXT: TLSv1.2 (IN), TLS handshake, Certificate (11):
CURL_INFO_TEXT: TLSv1.2 (IN), TLS handshake, Server key exchange (12):
CURL_INFO_TEXT: TLSv1.2 (IN), TLS handshake, Server finished (14):
CURL_INFO_TEXT: TLSv1.2 (OUT), TLS handshake, Client key exchange (16):
CURL_INFO_TEXT: TLSv1.2 (OUT), TLS change cipher, Change cipher spec (1):
CURL_INFO_TEXT: TLSv1.2 (OUT), TLS handshake, Finished (20):
CURL_INFO_TEXT: TLSv1.2 (IN), TLS handshake, Finished (20):
CURL_INFO_TEXT: SSL connection using TLSv1.2 / ECDHE-RSA-AES256-GCM-SHA384
CURL_INFO_TEXT: ALPN: server accepted h2
CURL_INFO_TEXT: Server certificate:
CURL_INFO_TEXT: subject: C=NO; ST=Tr�ndelag; O=Stiftelsen norsk institutt for naturforskning NINA; CN=ipt.nina.no
CURL_INFO_TEXT: start date: Mar 30 00:00:00 2023 GMT
CURL_INFO_TEXT: expire date: Mar 29 23:59:59 2024 GMT
CURL_INFO_TEXT: subjectAltName: host "ipt.nina.no" matched cert's "ipt.nina.no"
CURL_INFO_TEXT: issuer: C=GB; ST=Greater Manchester; L=Salford; O=Sectigo Limited; CN=Sectigo RSA Organization Validation Secure Server CA
CURL_INFO_TEXT: SSL certificate verify ok.
CURL_INFO_TEXT: using HTTP/2
CURL_INFO_TEXT: h2h3 [:method: HEAD]
CURL_INFO_TEXT: h2h3 [:path: /archive.do?r=arko_gel&v=1.12]
CURL_INFO_TEXT: h2h3 [:scheme: https]
CURL_INFO_TEXT: h2h3 [:authority: ipt.nina.no]
CURL_INFO_TEXT: h2h3 [accept: */*]
CURL_INFO_TEXT: Using Stream ID: 1 (easy handle 0x5625c3053740)
CURL_INFO_HEADER_OUT: HEAD /archive.do?r=arko_gel&v=1.12 HTTP/2
Host: ipt.nina.no
accept: */*
CURL_INFO_HEADER_IN: HTTP/2 200
CURL_INFO_HEADER_IN: server: nginx
CURL_INFO_HEADER_IN: date: Thu, 16 Nov 2023 15:47:43 GMT
CURL_INFO_HEADER_IN: content-type: application/zip;charset=ISO-8859-1
CURL_INFO_HEADER_IN: access-control-allow-origin: *
CURL_INFO_HEADER_IN: access-control-allow-methods: GET, OPTIONS, HEAD
CURL_INFO_HEADER_IN: set-cookie: JSESSIONID=4B6828221A77DC32CF98E10E226CC45D; Path=/; Secure; HttpOnly
CURL_INFO_HEADER_IN: set-cookie: CSRFtoken=3izphOH65BQcxRFLjOS7t5GUZLPtDCVQ; Max-Age=900; Expires=Thu, 16-Nov-2023 16:02:43 GMT; Domain=ipt.nina.no; Secure; HttpOnly
CURL_INFO_HEADER_IN: content-disposition: filename="dwca-arko_gel-v1.12.zip"
CURL_INFO_HEADER_IN: content-language: en-GB
CURL_INFO_HEADER_IN:
CURL_INFO_TEXT: Connection #0 to host ipt.nina.no left intact
VSICURL: HEAD did not provide file size. Retrying with GET
CURL_INFO_TEXT: Couldn't find host ipt.nina.no in the (nil) file; using defaults
CURL_INFO_TEXT: Found bundle for host: 0x5625c304fba0 [can multiplex]
CURL_INFO_TEXT: Re-using existing connection #0 with host ipt.nina.no
CURL_INFO_TEXT: h2h3 [:method: GET]
CURL_INFO_TEXT: h2h3 [:path: /archive.do?r=arko_gel&v=1.12]
CURL_INFO_TEXT: h2h3 [:scheme: https]
CURL_INFO_TEXT: h2h3 [:authority: ipt.nina.no]
CURL_INFO_TEXT: h2h3 [accept: */*]
CURL_INFO_TEXT: Using Stream ID: 3 (easy handle 0x5625c3053740)
CURL_INFO_HEADER_OUT: GET /archive.do?r=arko_gel&v=1.12 HTTP/2
Host: ipt.nina.no
accept: */*
CURL_INFO_HEADER_IN: HTTP/2 200
CURL_INFO_HEADER_IN: server: nginx
CURL_INFO_HEADER_IN: date: Thu, 16 Nov 2023 15:47:43 GMT
CURL_INFO_HEADER_IN: content-type: application/zip;charset=ISO-8859-1
CURL_INFO_HEADER_IN: access-control-allow-origin: *
CURL_INFO_HEADER_IN: access-control-allow-methods: GET, OPTIONS, HEAD
CURL_INFO_HEADER_IN: set-cookie: JSESSIONID=B3C461DAF4F8F7557793AE4A89671C5B; Path=/; Secure; HttpOnly
CURL_INFO_HEADER_IN: set-cookie: CSRFtoken=aNzUY4o5T8c3WodGxujjzJYYiiqHKu1K; Max-Age=900; Expires=Thu, 16-Nov-2023 16:02:43 GMT; Domain=ipt.nina.no; Secure; HttpOnly
CURL_INFO_HEADER_IN: content-disposition: filename="dwca-arko_gel-v1.12.zip"
CURL_INFO_HEADER_IN: content-language: en-GB
CURL_INFO_HEADER_IN:
CURL_INFO_TEXT: Failure writing output to destination
CURL_INFO_TEXT: Connection #0 to host ipt.nina.no left intact
VSICURL: GetFileSize(https://ipt.nina.no/archive.do?r=arko_gel&v=1.12): response_code=200, curl error msg=Failure writing output to destination
VSICURL: Request at offset 0, after end of file
VSICURL: Request at offset 0, after end of file
VSICURL: Request at offset 0, after end of file
[...]
FAILURE:
Unable to open datasource `CSV:/vsizip/{/vsicurl/https://ipt.nina.no/archive.do?r=arko_gel&v=1.12}/occurrence.txt' with the following drivers.
[...]
Setting use_head=no does not fix the issue, nor using vsicurl_streaming.
It works if I save the ZIP locally and serve it with a simple file server.
I used mitmproxy, and I see that the file is retrieved correctly, entirely. I then wonder why GDAL cannot open it.
I used varnish to disable the streamed response, so it can return Content-Length and Accepted-Ranges headers. This is a valid workaround, and it provides a mechanism to cache the file as well: https://gist.github.com/frafra/cdfc98cdbbe93bbdb73ed6363c5c613f
Operating system
Fedora 38 x86_64.
GDAL version and provenance
GDAL 3.6.4 from Fedora official repositories.
/vsicurl/ doesn't work with all HTTP servers, and in particular generally not with ones generated dynamic content. They need to support arbitrary Range requests and report the file size in Content-Length
/vsicurl_streaming/ is only usable for formats that can be read from start to end without seaking, and ZIP compression does not enable that, since reading a ZIP file requires reading the directory content located at the end of the file
So I don't believe there's anything that can be fixed on GDAL side
Wouldn't be possible to mimic what vsistdin is doing, by adding buffer_limit option to vsicurl_streaming? This works great:
curl 'https://ipt.nina.no/archive.do?r=arko_gel&v=1.12' | ogrinfo -oo X_POSSIBLE_NAMES=decimalLongitude -oo Y_POSSIBLE_NAMES=decimalLatitude 'CSV:/vsizip/{/vsistdin?buffer_limit=-1}/occurrence.txt'
closing as I don't foresee any further action on this