gdal icon indicating copy to clipboard operation
gdal copied to clipboard

How one can cache Dataset

Open REASY opened this issue 2 years ago • 10 comments

Hello, team,

I have a slippy server that serves Slippy Tiles implemented as HTTP server using gdal-rs. Actual rasters are partitioned in many Cloud Optimized GeoTIFF (COG) files with overviews. On high level, I extract tile information from the request that looks like /:prefix/:layer/:z/:x/:y and map it to overview and offset to read from COG. My COG files are stored in S3 and I use vsis3. In the beginning of request I open Dataset, in the end it is implicitly closed because of drop. Interestingly, if I query the same slippy tile twice, the only first request has high latency, the second one is much faster (is it because of VSI cache?):

2023-07-21T02:13:12.185123Z  INFO tokio-runtime-worker ThreadId(03) qartez_slippy_server::routes: src/routes.rs:169: Read and prepared a tile for .../20/179207/418903.png from /vsis3/.../color/geotiff/5600_13090.tif in 475 ms
2023-07-21T02:13:42.265197Z  INFO tokio-runtime-worker ThreadId(02) qartez_slippy_server::routes: src/routes.rs:169: Read and prepared a tile for .../20/179207/418903.png from /vsis3/.../color/geotiff/5600_13090.tif in 3 ms

Does it make sense in such scenario to cache the C descriptor of Dataset and reuse it? Or VSI_CACHE_SIZE together with GDAL_CACHEMAX should be enough?

Thank you.

REASY avatar Jul 21 '23 02:07 REASY

Yeah, it's a bit unfortunate. GDAL doesn't allow you to read from a Dataset from multiple threads at once, even though cURL could probably support it just fine.

So I think your options are to either:

  • open and close the dataset on each read, which will incur a good bit of overhead (the TLS handshake and and reading the IFDs, I guess)
  • have a thread or pool of threads where each opens the file, gets a read request from a channel, does the actual read, sends the results back, then loops; this should work pretty well, but you'll be storing duplicate data in the GDAL cache

I should probably ask on the mailing list for clarification, though.

lnicola avatar Jul 21 '23 06:07 lnicola

Starting with GDAL 3.6.0, if the GDAL_NUM_THREADS config option is set, reading in a TIFF/COG file a window of interest that intersects multiple tiles at one will use multithreaded decompression (cf https://github.com/OSGeo/gdal/blob/v3.6.0/NEWS.md), and in GDAL 3.7.0 this was further improved to trigger parallel network requests

rouault avatar Aug 19 '23 20:08 rouault

I don't think multi-threaded decoding helps in this case (a tile server), since each request will read a single block if everything is set up properly. But we can't have everything just yet :⁠-⁠).

lnicola avatar Aug 20 '23 07:08 lnicola

@REASY

Not sure if this could be considered canonical or even acceptable (YMMV), but we have a production tile server written in Axum + georust/gdal and have been caching without problems using this (GdalPath in an internal type which basically combines a GDAL vsi path + band specifiers):

use crate::raster::GdalPath;
use crate::Error;
use gdal::Dataset;
use moka::sync::Cache;
use once_cell::sync::Lazy;
use std::ops::Deref;
use std::sync::{Arc, Mutex};
use std::time::Duration;

pub(crate) struct DatasetCache(Cache<GdalPath, Arc<Mutex<Dataset>>>);

static INSTANCE: Lazy<DatasetCache> = Lazy::new(DatasetCache::new);

impl DatasetCache {
    fn new() -> Self {
        Self(
            Cache::builder()
                .time_to_idle(Duration::from_secs(3600))
                .max_capacity(5)
                .build(),
        )
    }
    pub(crate) fn dataset_for(path: &GdalPath) -> crate::Result<Arc<Mutex<Dataset>>> {
        let ds = INSTANCE.0.try_get_with(path.clone(), || {
            let ds: Result<Dataset> = path.open();
            ds.map(|d| Arc::new(Mutex::new(d)))
                .map_err(|e| e.to_string())
        });
        ds.map_err(|e| Error::Unexpected(e.deref().clone()))
    }
}

metasim avatar Aug 21 '23 17:08 metasim

Isn't the problem that Datasets are not Send? You can add Mutexes around it, so that it is Sync , but you cannot enforce the Send.

There are shared datasets in GDAL, but we haven't implemented them since they cannot simply be used with all the stuff currently implemented for a dataset.

We have done the thread + channel thing that @lnicola mentioned :laughing: .

EDIT: Was wrong, they are Send but subtypes like bands aren't. So for datasets, you are ready to go.

ChristianBeilschmidt avatar Jan 30 '24 07:01 ChristianBeilschmidt

Yeah, IIRC shared datasets are actually the opposite of the "open the file multiple times" trick. Instead, you (probably) get a mutex around each access, but end up with better cache utilization.

In the beginning of request I open Dataset, in the end it is implicitly closed because of drop.

You can stick them in an Arc<Mutex<HashMap>> or something, of course. They don't have to disappear at the end of the scope.

lnicola avatar Jan 30 '24 08:01 lnicola

Yeah, IIRC shared datasets are actually the opposite of the "open the file multiple times" trick. Instead, you (probably) get a mutex around each access, but end up with better cache utilization.

no, you don't. You just get the same dataset (if calling GDALOpenShared() from the same thread from which the initial one was opened. Otherwise you'll get a different instance)

rouault avatar Jan 30 '24 11:01 rouault

Oh, right. Well that's an argument for Dataset not being Send, because otherwise you can open a shared one twice and pass it to a different thread, which is bad.

lnicola avatar Jan 30 '24 11:01 lnicola

You can't call GDALOpenShared with this library at the moment. This is why we can say that Dataset: Send.

There would need to be a second type of Dataset , e.g., SharedDataset, which would call GDALOpenShared under the hood but then not being Send.

ChristianBeilschmidt avatar Feb 04 '24 08:02 ChristianBeilschmidt

You're right, there's even a note in the docs:

Note that the GDAL_OF_SHARED option is removed from the set of allowed option because it subverts the Send implementation that allow passing the dataset the another thread. See https://github.com/georust/gdal/issues/154.

lnicola avatar Feb 04 '24 08:02 lnicola