gdal Extended hang in GDALDataset destructor after (read-only) RasterIO call on GTiff (Windows specific)

What is the bug?

On Windows only (i.e. not Linux), I am seeing an extended hang of 15-20 seconds when freeing up a GDALDataset object that was created by opening a large (~ 2 GB) GeoTIFF file with GA_ReadOnly setting.

Curiously, I am seeing this for some files on both the machines where I am testing; but for other files, I see it only on one of the two machines. (Same GDAL version, same bit-for-bit file) Hence I am suspecting perhaps an uninitialized variable may be influencing the behavior.

We have observed this issue both with the most recently-released GDAL 3.8.4, and also with the GDAL version we were using prior to upgrading (2.2.1).

Steps to reproduce the issue

I've written a minimal test program, pasted below, that uses the GDAL API to replicate the hang. When I do not call GDALRasterBand::RasterIO() the hang does not happen so perhaps it has something to do with a "dirty" flag.

When run against a file that doesn't exhibit the hanging behavior at close time, the output looks something like this:

register: dt = 2 ms
open: dt = 0 ms
get band: dt = 0 ms
alloc: dt = 0 ms
read: dt = 589 ms
process: dt = 24 ms
dealloc: dt = 16 ms
destroy: dt = 34 ms

When run against a file that does exhibit the hang at close, the output looks something like this:

register: dt = 2 ms
open: dt = 0 ms
get band: dt = 0 ms
alloc: dt = 0 ms
read: dt = 912 ms
process: dt = 21 ms
dealloc: dt = 14 ms
destroy: dt = 16951 ms

Test program "testme.cpp" --

// Compile with flags /EHsc /std:c++20 /W4 /permissive-

#include "gdal_priv.h"

#include <cassert>
#include <chrono>
#include <cstdio>

static auto now()
{
    return std::chrono::high_resolution_clock::now();
}

static void report_time(const char * label, auto & t1)
{
    auto t2 = now();
    auto delta = std::chrono::duration_cast<std::chrono::milliseconds>(t2 - t1);
    fprintf(stderr, "%s: dt = %lld ms\n", label, (long long)delta.count());
    t1 = now();
}

int main(int argc, char ** argv)
{
    unsigned char c = 255;
    auto t1 = now();

    GDALAllRegister();
    // Forbid generation of ".aux.xml" files
    CPLSetConfigOption("GDAL_PAM_ENABLED", "NO");

    report_time("register", t1);

    std::unique_ptr<GDALDataset>
      img{static_cast<GDALDataset*>(GDALOpen(argv[1], GA_ReadOnly))};
    report_time("open", t1);

    assert(img && img->GetRasterCount() > 0);
    auto band = img->GetRasterBand(1);
    assert(band->GetRasterDataType() == GDT_Byte);
    int x = band->GetXSize(), y = band->GetYSize();
    assert(x > 0 && y > 0);
    report_time("get band", t1);

    size_t n = size_t(x) * size_t(y);
    auto p = new unsigned char [n];
    report_time("alloc", t1);

#ifndef NO_RASTER_IO
    auto stat = band->RasterIO(GF_Read, 0, 0, x, y, p, x, y, GDT_Byte, 1, x);
    assert(stat == CE_None);
    report_time("read", t1);

    // Do something with the pixels so the optimizer doesn't skip anything
    for (size_t i = 0; i < n; ++i)
	c ^= p[i];

    report_time("process", t1);
#endif

    delete [] p;
    report_time("dealloc", t1);

    img.reset();
    report_time("destroy", t1);

    return n % (int(c) + 1);
}

To run the test program, run it with one argument, which should be a large GeoTIFF file with 8-bit bands. E.g.:

testme.exe 2G.tif

Here is one possibility for obtaining such a file. Take note that I've been able to reproduce the hang with this file on only one of two machines I tested, so you might not immediately observe it.

dutch_wmts.zip

Unzip the above to obtain an XML file for a WMS
gdal_translate -projwin 100000 480147 100147 480000 -outsize 22400 22400 dutch_wmts.xml 2G.tif

Versions and provenance

GDAL version: GDAL 3.8.4, released 2024/02/08
Windows 10
Self-built binaries using Visual C++ (cl.exe 19.29.30151)
Dependent library versions:
- CURL: 8.4.0
- ZLIB: 1.3.1
- PROJ: 9.3.1
- libgeotiff: 1.7.1
- libtiff: 4.6.0

Additional context

Thank you in advance for any help you can provide! If there is a work-around we could use in our own code that doesn't involve editing GDAL code, that would be most welcome information also.

Mar 19 '24 20:03 kmccarty-dgi

After some quality time in the debugger, it appears to me that what is happening -- in the case where there is a hang-on-exit -- is that, for each GDALRasterBand of the GTiffDataset, the code enters GDALArrayBandBlockCache::FlushCache(void), which then proceeds to iterate over all the blocks per column and blocks per row to call FlushBlock(). This is the call stack at this point:

	gdald.dll!GDALArrayBandBlockCache::FlushCache() Line 275	C++
 	gdald.dll!GDALRasterBand::FlushCache(bool bAtClosing) Line 1123	C++
 	gdald.dll!GDALDataset::FlushCache(bool bAtClosing) Line 569	C++
 	gdald.dll!GDALPamDataset::FlushCache(bool bAtClosing) Line 188	C++
 	gdald.dll!GTiffDataset::FlushCacheInternal(bool bAtClosing, bool bFlushDirectory) Line 2104	C++
 	gdald.dll!GTiffDataset::Finalize() Line 233	C++
 	gdald.dll!GTiffDataset::Close() Line 153	C++
 	gdald.dll!GTiffDataset::~GTiffDataset() Line 141	C++

I have to think that this cache flush shouldn't be occurring from within the destructor, since the dataset was opened read-only (and the code hasn't tried to open any on-disk file for write)? Please let me know if there's any more info I can provide.

Mar 19 '24 21:03 kmccarty-dgi

Don't know if this helps any further, but whenever I randomly break during the apparent hang, the point of execution as per Visual Studio is inside VSIFreeAligned(). (which on Windows does nothing more than call _aligned_free().)

Mar 19 '24 22:03 kmccarty-dgi

Take note that I've been able to reproduce the hang with this file on only one of two machines I tested, so you might not immediately observe it.

This might be depending on the amount of RAM they have, which influences the size of the GDAL block cache which can be set with GDAL_CACHEMAX (https://gdal.org/user/configoptions.html#performance-and-caching). Do you observe this on release builds of GDAL? I wouldn't expect freeing just 2 GB of RAM to take 16 seconds...

Mar 19 '24 22:03 rouault

Yes, I'm seeing the same slow-down on close (for input TIFF files and machines where it happens) in both debug and release builds of GDAL.

I wouldn't expect freeing just 2 GB of RAM to take 16 seconds...

I agree, which is why I found this so odd! But I haven't been able to pin down what is different between input file/machine combinations where I see this problem, versus those where I don't. What would you recommend I try setting GDAL_CACHEMAX to, to influence this behavior?

Mar 19 '24 22:03 kmccarty-dgi

What would you recommend I try setting GDAL_CACHEMAX to, to influence this behavior?

first, run your code with CPL_DEBUG=ON and look at the "GDAL: GDAL_CACHEMAX = xxxx MB" trace in the slow and fast cases, and start playing with that information

Mar 19 '24 22:03 rouault

If your GeoTIFF is uncompressed, you may also try setting the GTIFF_DIRECT_IO=YES configuration option (see https://gdal.org/drivers/raster/gtiff.html#configuration-options)

Mar 19 '24 22:03 rouault

If I set GDAL_CACHEMAX=2% then I do not see the problem with either of my large test TIFF files ... whereas if I set it to 10% then I do see the issue with both of them. (This is on a laptop with 32 GB RAM.) So this gives me some hope that you might be able to reproduce locally by tuning that setting using the TIFF file I described above how to obtain, or maybe with any large GeoTIFF.

On the other machine, a desktop with 128 GB RAM, I have to reduce it further and set GDAL_CACHEMAX=1% in order not to observe the slowdown-on-close.

With CPL_DEBUG=ON, it looks to me like the slowdown-on-exit happens exactly in the cases when GDAL_CACHEMAX is large enough that this message does NOT appear: "GTiff: Disable aggressive band caching. Cache not big enough. At least 1702774400 bytes necessary"

With GTIFF_DIRECT_IO=YES I do not see the slowdown-on-exit at all regardless of setting of GDAL_CACHEMAX. But we'd like to support compressed as well as uncompressed TIFFs, so I don't see this setting as being something we can feasibly use ourselves.

So to sum up, to observe the slowdown:

The TIFF file has to be large enough (500 MB+ uncompressed?) that the slowdown-on-exit is clearly a problem
The GDAL_CACHEMAX setting (or default) has to be large enough that the uncompressed TIFF fits in cache
The GTIFF_DIRECT_IO setting should not be ON
(As mentioned in the original report) A call to GDALRasterBand::RasterIO() has been made

Thanks again, and please let me know if you'd like any more info!

Mar 19 '24 23:03 kmccarty-dgi

I've investigated that, and it is indeed a Windows specific issue I could reproduce with the following toy code:

#include <windows.h>
#include <stdlib.h>
#include <stdio.h>
#include <assert.h>

HANDLE heap = NULL;
int flag = 0; 

int SINGLE_ALLOC_SIZE = 21200;
int NUMBER_OF_ALLOCS = 21200 * 4;

class CMyClass
{
public:
  CMyClass()
  {
      // lpData = malloc(SINGLE_ALLOC_SIZE);
      // lpData = _aligned_malloc(SINGLE_ALLOC_SIZE, 64);
      lpData = HeapAlloc(heap, flag, SINGLE_ALLOC_SIZE);
      //lpData = new char[SINGLE_ALLOC_SIZE];
      assert(lpData);
  };

  ~CMyClass()
  {
      // free(lpData);
      //_aligned_free(lpData);
      HeapFree(heap, flag, lpData);
      //delete[] lpData;
  };

public:
  void* lpData;
};

int main()
{
    do
    {
#ifdef FAST
        size_t max_heap_size = size_t(2) * 1000 * 1000 * 1000;
#else
        size_t max_heap_size = 0;
#endif
        heap = HeapCreate(flag, flag, max_heap_size);
        printf("start\n");
        CMyClass *lpList = new CMyClass[NUMBER_OF_ALLOCS];
        printf("after alloc. starting freeing\n");
        delete [] lpList;
        printf("call HeapDestroy\n");
        HeapDestroy(heap);
        printf("end\n");
    }
    while(1);
    return 0;
}

So I've tried regular malloc()/free(), _aligned_malloc()/_aligne_free(), new[]/delete[], and all show the same issue: freeing a lot of allocation is extremely slow. Using a private heap, I've managed to get decent performance, but only when setting a non-zero maximum heap size at HeapCreate() time. However, while we might now it from GDAL_CACHEMAX, this still consumes the corresponding virtual memory, which might not be desirable.

Mar 21 '24 00:03 rouault

Other findings, the issue only appears if SINGLE_ALLOC_SIZE is strictly greater than 4 * 4096 bytes.

Mar 21 '24 00:03 rouault