Feature Request: Retry failed page requests in ESRIJSON driver
What is the bug?
Esri FeatureServers tend to be quite finicky and it's common for a single request to fail when paging through features. Naturally, retrying failed requests would potentially alleviate these blips, but it appears that setting GDAL_HTTP_MAX_RETRY to a positive integer has no effect.
Steps to reproduce the issue
Since this affects unreliable services, it's not very easy to reproduce. However, here's an attempt to do so by iterating over all features in a finicky server while printing remote calls by toggling on CPL_DEBUG.
You'll notice that, when the FeatureServer fails to return a page, GDAL doesn't perform any retries:
import osgeo.ogr
import osgeo.gdal
url = f"ESRIJSON:https://services.arcgis.com/v01gqwM5QqNysAAi/arcgis/rest/services/Protection_Mechanism_Category_PADUS/FeatureServer/0/query?where=1%3D1&outfields=%2A&f=json&geometryPrecision=6&orderByFields=OBJECTID+ASC&resultRecordCount"
with osgeo.gdal.config_options(
{
"GDAL_HTTP_MAX_RETRY": "5",
"CPL_DEBUG": "ON",
}
):
ds = osgeo.ogr.Open(url)
lyr = ds.GetLayer()
for idx, _ in enumerate(range(lyr.GetFeatureCount())):
lyr.GetNextFeature()
if idx % 2000 == 0: # Max page size for this service
print(idx)
Versions and provenance
GDAL version: GDAL 3.10.2, released 2025/02/11 Operating system: macOS Sonoma 14.5 (23F79) Installed via homebrew
Additional context
As a side note, many Esri FeatureServers out there seem to be under-provisioned and error on larger page sizes (even if within the advertised max). Reducing the page size by adding e.g. resultRecordCount=100 seems to alleviate the issue in some cases (but not all). Of course, this isn't optimal -- this makes paginating through a Feature Service noticeably slower and, in many cases, one of these pages still fails (and therefore the entire process).
When the request fails, what is the HTTP or curl error?
Oh of course, I forgot to mention that. Here's the output on a failed request, with both CPL_DEBUG=ON and CPL_CURL_VERBOSE=YES (as well as GDAL_HTTP_MAX_RETRY=5, which is apparently not used):
CURL_INFO_HEADER_IN: HTTP/2 200
CURL_INFO_HEADER_IN: content-type: text/plain
CURL_INFO_HEADER_IN: content-length: 75
CURL_INFO_HEADER_IN: date: Thu, 05 Sep 2024 09:40:30 GMT
CURL_INFO_HEADER_IN: last-modified: Wed, 09 Nov 2022 19:03:24 GMT
CURL_INFO_HEADER_IN: etag: "e2651e71c06f4a6d095cb118ebfc79e2"
CURL_INFO_HEADER_IN: x-amz-server-side-encryption: AES256
CURL_INFO_HEADER_IN: cache-control: max-age=0, must-revalidate
CURL_INFO_HEADER_IN: accept-ranges: bytes
CURL_INFO_HEADER_IN: server: AmazonS3
CURL_INFO_HEADER_IN: x-cache: Error from cloudfront
CURL_INFO_HEADER_IN: via: 1.1 1cbf6d6ef405e8e3fa256f628b03d41a.cloudfront.net (CloudFront)
CURL_INFO_HEADER_IN: x-amz-cf-pop: MAD56-P2
CURL_INFO_HEADER_IN: x-amz-cf-id: jV4Sv-3CR1cJAzVOMUW2vQsyVxEtTKtCnWcC3OzH5-_j8GpOF2p-uQ==
CURL_INFO_HEADER_IN: age: 18236651
CURL_INFO_HEADER_IN:
CURL_INFO_TEXT: Connection #0 to host services.arcgis.com left intact
---------------------------------------------------------------------------
RuntimeError Traceback (most recent call last)
Cell In[14], line 15
13 lyr = ds.GetLayer()
14 for idx, _ in enumerate(range(lyr.GetFeatureCount())):
---> 15 lyr.GetNextFeature()
16 if idx % page_size == 0:
17 print(idx)
File /opt/homebrew/lib/python3.11/site-packages/osgeo/ogr.py:1036, in Layer.GetNextFeature(self, *args)
1022 def GetNextFeature(self, *args):
1023 r"""
1024 GetNextFeature(Layer self) -> Feature
1025
(...)
1034
1035 """
-> 1036 return _ogr.Layer_GetNextFeature(self, *args)
RuntimeError: Failed to read ESRIJSON data
May be caused by: Invalid FeatureCollection object. Missing 'features' member.
HTTP 200 means that the ESRI server thinks that the response was OK, therefore GDAL_HTTP_MAX_RETRY does not trigger a new one. https://developer.mozilla.org/en-US/docs/Web/HTTP/Reference/Status/200
You probably have a support contract with your ESRI dealer, contact them and try to get a bug report to get accepted. It is hard, but I have succeeded couple of times. They may tell that because their server is used with a non-ESRI client, they will not study the case.
Right, it makes sense that GDAL doesn't retry because it received a HTTP 200 status code.
However, it's well known that Esri servers return 200's even on errors, and include an error message inside the JSON response. Here's one that I just got from that server:
{'error': {'code': 504,
'message': 'Your request has timed out.',
'details': []}}
While I understand this behavior is not ideal (and could even be described as erroneous), the reality is that Esri FeatureServers are ubiquitous, and I believe the ESRIJSON GDAL driver should know how to handle their error messages. In this case, that'd mean understanding the returned error in the JSON response (or at least retrying when the returned response isn't valid ESRIJSON/GeoJSON).
Also, for context, I work building a web mapping product that uses GDAL to read user-provided data, including Esri FeatureServers. I'm not affiliated with this specific server in any way, nor do I have a contract with Esri.
Do you know if those error messages which are flipped inside JSON are documented somewhere?
I wrote about making a bug report in wrong belief that you were using an OGC API Features service. With their own REST service ESRI is free to use whatever codes if the usage is matching with their specifications.
I recommend editing the title to describe your wish more accurately, I think that there is nothing wrong in GDAL_HTTP_MAX_RETRY. What you want is some kind of page-by-page JSON validation and special handling of error situations.