Possible regression of issue 1244
What is the bug?
I believe I am running into a possible regression of https://github.com/OSGeo/gdal/issues/1244 or something similar.
I have ~300 VRT files that point to ~2000000 Cloud Optimized Geotiffs all stored on S3 via the /vsis3 endpoint. The VRT files are created via gdalbuildvrt and organize the COGs into smaller geographic regions. All VRTS and COGs are in EPSG:4326.
In my application I am using MapScript via Python to render batches of tiles from these VRT files in EPSG:3857 (Spherical Mercator). The mapObj is set to output EPSG:3857 and I create layerObj objects with the VRT files that intersect the requested tiles.
This works fine for the most part, but occasionally I will have a tile fail with an error like:
drawGDAL(): Unable to access file. GDALDatasetRasterIO() failed: /vsis3/mybucket/fileXXX.tif, b
and 1: IReadBlock failed at X offset 38, Y offset 5: TIFFReadEncodedTile() failed.
Most of the time I can just rerun the tile generation process for that tile and it will work on the second run through.
I did manage to find one tile that failed every single time in every run, but there is nothing wrong with the data and I found that upping the GDAL_CACHEMAX to 2000 seemed to fix it for that particular tile (although its not using anywhere near that much memory to do the rendering). This does not fix it all the time and I still get occasional tile generation failures.
If I set CPL_CURL_VERBOSE to YES I see no errors coming from CURL, everything succeeds, so this this doesn't appear to be connectivity issue.
As a last ditch effort I tried creating spherical mercator warped VRTS using a command like this:
gdalwarp -of VRT -t_srs EPSG:3857 source.vrt source_mercator.vrt
If I use THOSE VRTs as inputs to MapServer then everything seems to work just fine, I haven't had a tile failure since. This appears to bypass MapServer's warping path and makes GDAL to do the warping itself.
I do notice that if I set CPL_DEBUG=ON that when I use the MapServer warping path I will see messages like this
GDAL: Potential thrashing of band 1 of .
I do NOT see that message if I use the spherical mercator VRTS I create with gdalwarp.
I am not explicitly using threading anywhere in my Python code and issue https://github.com/OSGeo/gdal/issues/1244 seems to be related to threading, so I'm not entirely sure if it's the same issue or just something related. However, the description of the problem in 1244 where it randomly fails with those error messages is exactly what I'm seeing. Even if MapServer is using threaded warping under the hood, 1244 should be fixed in 3.8.4 so I am at a loss as to what the real underlying issue would be. I feel it has to be something to do with a cache getting overrun because I saw a notable reduction in the number of failures when I increased the GDAL_CACHEMAX to 2000.
Sorry for the firehose of information, I am hoping that some of this rings a bell to someone and might be able to figure out what the actual issue is that I'm seeing.
Steps to reproduce the issue
The dataset I am having this issue with is a commercial dataset and I cannot provide an easily reproducible example.
Versions and provenance
GDAL 3.8.4 and MapServer 8.0.1 on Ubuntu 24.04 VSI_CACHE=TRUE GDAL_HTTP_MERGE_CONSECUTIVE_RANGES=YES GDAL_HTTP_MULTIPLEX=YES GDAL_HTTP_VERSION=2 GDAL_DISABLE_READDIR_ON_OPEN=EMPTY_DIR GDAL_HTTP_MAX_RETRY=2 GDAL_CACHEMAX=2000
Additional context
No response
An update on this, I am still seeing the same IReadBlock failed at X offset XXX Y offset XXX TIFFReadEncodedTile() failed on occasion when I am doing big tiling jobs. Any recommendations on where to look or how to debug this error would be appreciated.
Any recommendations on where to look or how to debug this error would be appreciated.
This could be in a lot of places. I'm afraid I've not suggestions, apart from providing a reproducer for local debugging
Have you tried a newer version of GDAL @jasonbeverage ? It reminds me of #11552, but we were seeing somewhat different error messages there.
Thanks for the tip @pedros007 . I looked at our configuration more closely and we are specifying GDAL_NUM_THREADS=1. I would think that GDAL_NUM_THREADS means don't use any threads though so I am not sure if that is related or not.