gdal icon indicating copy to clipboard operation
gdal copied to clipboard

[Windows] using stat (_wstat64) to check for file existence is very slow.

Open VirtualTim opened this issue 5 years ago • 7 comments

Profiling our usage of gdal on Windows revealed that the vast majority of time spent when opening a dataset using gdal is spent in checking for the file's existence. This is done using _wstat64. For some reason this is very slow. Basically GDALFindAssociatedFile calls VSIStatExL (potentially twice) to check for the existence of a file. Because of the structure of the GDALMDReaderManager::GetReader function multiple readers are initialised, when can result in dozens of _wstat64 calls.

As a suggestion, it may be good to implement a VSIFilesystemHandler::PathFileExists function, which on Windows calls PathFileExistsW (or GetFileAttributesW), and other platforms keeps the existing stat behaviour. This would also mean that it would be easy to swap that call to std::filesystem::exists, once gdal moves to c++17.

Alternatively GDALMDReaderManager::GetReader could check for the existence of the file before initialising the readers, removing the need for GDALFindAssociatedFile to check for existence. But I don't know if other things call that function.

Expected behavior and actual behavior.

This should not be as slow as it is. I know this may sound petty, but ~80% of the time spent opening a file was due to these _wstat64 calls.

I've attached a profiling flame char below, generated from opening ~2k JP2 files in a loop. I do realise that the dataset is being opened/closed twice, but that's an issue on my side. I realise the quality isn't great, but if this isn't enough to go on let me know.

Profiling Flame Chart

flame

Steps to reproduce the problem.

Open many files in a loop.

Operating system

Windows

GDAL version and provenance

Profiled on 2.3.0, but the code in master looks unchanged.

VirtualTim avatar Nov 03 '20 05:11 VirtualTim

As a suggestion, it may be good to implement a VSIFilesystemHandler::PathFileExists function, which on Windows calls PathFileExistsW, and other platforms keeps the existing stat behaviour

I don't think we'd need a new virtual method. GDALFindAssociatedFile() calls VSIStatExL() with just the VSI_STAT_EXISTS_FLAG flag set. The Win32 implementation could use that hint to just call PathFileExistsW(). And normally GDALMDReaderManager::GetReader() should be provided with the list of sibling files, so perhaps you have disabled this on your end with GDAL_DISABLE_READDIR_ON_OPEN ?

rouault avatar Nov 03 '20 09:11 rouault

The Win32 implementation could use that hint to just call PathFileExistsW().

That sounds like a good idea.

I am using GDAL_DISABLE_READDIR_ON_OPEN = "YES", as that had a pretty big performance impact when adding a single file in a folder containing many files.

Edit: Just did a test with GDAL_DISABLE_READDIR_ON_OPEN = "NO", and the time to bulk open files increased by ~5-30% (depending on filetype).

VirtualTim avatar Nov 04 '20 01:11 VirtualTim

I did some quick profiling between _wstat64, GetFileAttributesW, and PathFileExistsW. GetFileAttributesW, and PathFileExistsW performed the same (within margin of error), which was about 2x-3x faster than _wstat64. PathFileExistsW requires linking against shlwapi.lib, where GetFileAttributesW just requires including fileapi.h.

Source code (excuse the verbosity, I was trying to avoid unnecessary branching)
#include <Windows.h>
#include <fileapi.h>	//GetFileAttributesW
#include <Shlwapi.h>	//PathFileExistsW (required linking to shlwapi.lib)

#include <iostream>
#include <filesystem>

bool testStat(const wchar_t* pwszFilename) {
	struct __stat64 pStatBuf;
	int nResult = _wstat64(pwszFilename, &pStatBuf);
	return nResult == 0;
}

bool testAttr(const wchar_t* pwszFilename) {
	DWORD attributes = GetFileAttributesW(pwszFilename);
	return attributes != INVALID_FILE_ATTRIBUTES && !(attributes & FILE_ATTRIBUTE_DIRECTORY);
}

bool testExists(const wchar_t* pwszFilename) {
	return TRUE == PathFileExistsW(pwszFilename);
}


int wmain(int argc, wchar_t* argv[], wchar_t* envp[]) {
	if (argc != 3) {
		std::cout << R"(Usage: exe stat|attr|exists "path\to\folder")" << std::endl;
		exit(0);
	}

	std::wstring testCase = argv[1];
	int filesCounted = 0;

	if (testCase == L"stat") {
		for (const auto& dirEntry : std::filesystem::recursive_directory_iterator(argv[2])) {
			std::wstring fFilename = dirEntry.path().wstring();
			if (testStat(fFilename.c_str()))
				filesCounted++;
		}
		std::cout << "counted " << filesCounted << " files." << std::endl;
		return 0;
	}
	
	if (testCase == L"attr") {
		for (const auto& dirEntry : std::filesystem::recursive_directory_iterator(argv[2])) {
			std::wstring fFilename = dirEntry.path().wstring();
			if (testAttr(fFilename.c_str()))
				filesCounted++;
		}
		std::cout << "counted " << filesCounted << " files." << std::endl;
		return 0;
	}
	
	if (testCase == L"exists") {
		for (const auto& dirEntry : std::filesystem::recursive_directory_iterator(argv[2])) {
			std::wstring fFilename = dirEntry.path().wstring();
			if (testExists(fFilename.c_str()))
				filesCounted++;
		}
		std::cout << "counted " << filesCounted << " files." << std::endl;
		return 0;
	}
	
	return -1;
}

I haven't tested this change inside gdal, as that would be a lot more effort to get that building.

VirtualTim avatar Nov 05 '20 05:11 VirtualTim

Just noticed there's already a CPLCheckForFile (which calls _stat internally on windows), so maybe that should be used instead of VSIStatExL to check for existence. And CPLCheckForFile could be modified to use platform-specific APIs.

VirtualTim avatar Nov 10 '20 08:11 VirtualTim

Just noticed there's already a CPLCheckForFile (which calls _stat internally on windows), so maybe that should be used instead of VSIStatExL to check for existence. And CPLCheckForFile could be modified to use platform-specific APIs.

CPLCheckForFile is a higher-level API (supporting virtual file systems and siblings lists). It uses VSIStatL when it should actually use VSIStatExL with just the VSI_STAT_EXISTS_FLAG flag set, in order to check for file existence cheaply. So @rouault's point stands: The Win32 implementation of VSIStatExL() needs to be tuned.

dg0yt avatar Dec 17 '20 06:12 dg0yt

@VirtualTim Can you check https://github.com/OSGeo/gdal/pull/9256 works as expected?

rouault avatar Feb 20 '24 15:02 rouault

@rouault I am currently on gdal 3.3, so it would not be easy for me to test right now. I'll make a note of this to test when I next work on our GDAL code.

VirtualTim avatar Feb 21 '24 04:02 VirtualTim