VirtualiZarr icon indicating copy to clipboard operation
VirtualiZarr copied to clipboard

GDAL Virtual Rasters

Open TomNicholas opened this issue 1 year ago • 31 comments

From https://docs.csc.fi/support/tutorials/gis/virtual-rasters/ (emphasis mine):

Virtual rasters is useful GDAL concept for managing large raster datasets that are split into not overlapping map sheets. Virtual rasters are not useful for managing time-series or overlapping rasters, for example remote sensing tiles.

Technically a virtual raster is just a small xml file that tells GDAL where the actual data files are, but from user's point of view virtual rasters can be treated much like any other raster format. Virtual rasters can include raster data in any file format GDAL supports. Virtual rasters are useful because they allow handling of large datasets as if they were a single file eliminating need for locating correct files.

It is possible to use virtual rasters so, that only the small xml-file is stored locally and the big raster files are in Allas, Amazon S3, publicly on server or any other place supported by GDAL virtual drivers. The data is moved to local only for the area and zoom level requested when the virtual raster is opened. The best performing format to save your raster data in remote service is Cloud optimized GeoTIFF, but other formats are also possible.

That sounds a lot like a set of reference files doesn't it... Maybe we could ingest those virtual raster files and turn them into chunk manifests, like we're doing with DMR++ in #113?

Also we can definitely open Cloud optimized GeoTIFFS now (since #162).

Thanks to @scottyhq for mentioning this idea. Maybe him, @abarciauskas-bgse, or someone else who knows more about GDAL can say whether they think this idea might actually work or not.

TomNicholas avatar Jun 29 '24 19:06 TomNicholas

I'm not an expert on VRTs but I think it could work. It could potentially be useful if you want to create a dataset from rasters which are overlapping and the VRT represents an already dedupped version of the data (assuming the logic for deduplication is appropriate). Mostly, I'm not sure how useful it is to have this functionality because I am not familiar of VRTs that are made publicly available or published for general use. I have heard of VRTs being used for on-the-fly definition of mosaics.

I am also going to tag my colleagues @wildintellect and @vincentsarago who have more experience with VRTs than I do and may be able to think of reasons this may or may not work.

abarciauskas-bgse avatar Jun 29 '24 23:06 abarciauskas-bgse

@abarciauskas-bgse converting a VRT to a Reference File for Zarr seems fine. I'm not sure the VRT would contain all the chunk information you need so the source files may also need to also be scanned. At that point it's not super different than just being given a list of files to include in a manifest.

Example:

<VRTDataset rasterXSize="512" rasterYSize="512">
    <GeoTransform>440720.0, 60.0, 0.0, 3751320.0, 0.0, -60.0</GeoTransform>
    <VRTRasterBand dataType="Byte" band="1">
        <ColorInterp>Gray</ColorInterp>
        <SimpleSource>
        <SourceFilename relativeToVRT="1">utm.tif</SourceFilename>
        <SourceBand>1</SourceBand>
        <SrcRect xOff="0" yOff="0" xSize="512" ySize="512"/>
        <DstRect xOff="0" yOff="0" xSize="512" ySize="512"/>
        </SimpleSource>
    </VRTRasterBand>
</VRTDataset>

Fun I didn't know about https://gdal.org/drivers/raster/vrt_multidimensional.html not sure I've ever seen one of these.

To be clear a VRT does not de-duplicate anything. When using a VRT with GDAL

If there is some amount of spatial overlapping between files, the order of files appearing in the list of source matter: files that are listed at the end are the ones from which the content will be fetched https://gdal.org/programs/gdalbuildvrt.html https://gdal.org/drivers/raster/vrt.html

So up to you if you'd want a VRT which takes effort, or would rather just be passed a list of files to include in a mosaiced reference file.

Here's a great one you can experiment with https://github.com/scottstanie/sardem/blob/master/sardem/data/cop_global.vrt This shows nested VRTs and point to a public dataset on AWS that is a global DEM with no overlaps, 1 projection, only 1 band, and 1 time point. So in some ways the simplest possible scenario.

wildintellect avatar Jul 02 '24 21:07 wildintellect

Interesting thanks @wildintellect .

Thanks for clearing that up about de-duplication. I was under the impressions that VRTs could represent a mosaic after deduplication of source files (e.g. spatial overlapping is resolved through logic while building the VRT). But I suppose that use case would be choosing overlapping data preference by block level, not pixel level.

abarciauskas-bgse avatar Jul 03 '24 00:07 abarciauskas-bgse

Thanks for the ping @TomNicholas! Some good points have already been mentioned. I think I just brought up VRTs because they are another example of lightweight sidecar metadata that simplifies the user experience of data management :) ... I haven't thought too much about integrations with virtualizarr, but some ideas below:

I suppose in the same way you create a reference file for NetCDF/DMR++ to bypass HDF and use Zarr instead, you could do the same for TIFF/VRT to bypass GDAL. Would probably want to do some benchmarking there, because unlike hdf, GDAL is pretty good at using overviews and efficiently figuring out range requests during reads (for the common case of a VRT pointing at cloud-optimized geotiffs).

I think another connection here is what is the serialization format for virtualizarr and what is it's scope? My understanding is the eventual goal is to save directly to ZARR v3 format and there are I'm sure lots of existing discussions that I'm not up to speed on. But my mental model is that VRT, STAC, ZARR, KerchunkJSON are all lightweight metadata mappings that can encode many things (file and byte locations, arbitrary metadata, "on read" computations like scale and offset, subset, reprojection).

It seems these lightweight mappings work well up to a limit, and then you encounter the need for some sort of spatial index or database system :) So again, my mapping becomes (KerchunkJSON -> Parquet, VRT -> GTI, STAC -> pgSTAC, ZARR -> Earthmover?

scottyhq avatar Jul 03 '24 17:07 scottyhq

Thanks @scottyhq !

lightweight metadata mappings that can encode many things (file and byte locations, arbitrary metadata, "on read" computations like scale and offset, subset, reprojection).

I see the chunk manifest as exclusively dealing with file and byte locations, and everything else in that list should live elsewhere in zarr (e.g. codecs or metadata following a certain convention).

I would be very curious to hear @mdsumner's thoughts on all the above.

TomNicholas avatar Aug 08 '24 18:08 TomNicholas

I think @scottyhq captured my stance well, and I'm glad to see GTI mentioned here - that's really important, and new.

I actually see this completely in the opposite direction, and I wish there was more use of GDAL and VRT itself, it's amazing - but there's these heavy lenses in R and Python over the actual API (but, we have very good support in {gdalraster} and in osgeo.gdal already) - that's a story for elsewhere.

VRT is already an extremely lightweight virtualization, and I went looking for a similar serialization/description for xarray and ended up here. kerchunk/virtualizarr is perfect for hdf/grib IMO but not for the existing GDAL suite. Apart from harvesting filepaths, urls, connections (database strings, vrt:// strings, /vsi* protocols) I don't see what would be the point. There certainly could be a Zarr description of a mosaic, but I'd be adding that as feature to GDAL as the software to convert it from VRT or from a WMTS connection, etc, not trying to bypass it. VRT can mix formats too, it's a very general way to craft a virtual dataset from disparate and even depauparate sources.

If you want to bypass GDAL for TIFF I think you've already got what's needed, but to support VRT you would need to recreate GDAL in large part. How would it take a subset/decimation/rescaling/set-missing-metadata description for a file? I don't think you can sensibly write reference byte ranges for parts of native tiles.

All that said, I'm extremely interested in the relationship between image-tile-servers/GTI/VRT and the various vrt:// and /vsi* abstractions, and how Zarr and its virtualizations work. There are gaps in both, but together they cover a huge gamut of capability and I'm exploring as fast as I can to be able to talk more sensibly about all that.

mdsumner avatar Aug 09 '24 00:08 mdsumner

oh one technical point on the mention of "byte locations", which I misplaced in my first read

lightweight metadata mappings that can encode many things (file and byte locations, arbitrary metadata, "on read" computations like scale and offset, subset, reprojection).

That is not a general VRT thing (I think that also wasn't being suggested, but still I think it's worth adding more here)

Apart from being able to describe "raw" sources. You can craft VRT that wraps a blob in a file or in memory, described by shape,bbox,crs,dtype,address,size for example, but it's not something that's used for formats with an official driver.

The documentation is here:

Virtual raster:

https://gdal.org/drivers/raster/vrt.html

Virtual file systems:

https://gdal.org/user/virtual_file_systems.html

VRT for raw binary files:

https://gdal.org/drivers/raster/vrt.html#vrt-descriptions-for-raw-files

MEM or in-memory raster: https://gdal.org/drivers/raster/mem.html

I think it's interesting in its relationship to how virtualizarr/kerchunk works and there's a lot of potential crossover.

mdsumner avatar Aug 09 '24 00:08 mdsumner

VRT is already an extremely lightweight virtualization

I don't think the idea here would be to replace or add a new layer, but instead to create tools that can easily translate VRT to virtual Zarr or possibly vice versa. See the issues on DMR++ for a similar idea for another virtualization format.

TomNicholas avatar Aug 09 '24 00:08 TomNicholas

One thing to keep in mind: GDAL already has support for reading Zarr (I think including Zarr v3). Once the Chunk Manifest ZEP is stabilized, hopefully we can get it added to the GDAL reader, which will hopefully open up these Chunk Manifests to everything built on GDAL.

TomAugspurger avatar Aug 09 '24 01:08 TomAugspurger

Coming back to this conversation from #432 and in a post-icechunk world.

The recommended serialisation format for VirtualiZarr is now effectively Icechunk. But the points @mdsumner and @scottyhq raise above seem like very significant differences between the data models of Zarr and VRT that preclude turning VRTs into Icechunk Zarr stores for anything but very specific cases (e.g. those with no overlap).

However maybe if you used Icechunk's transactional engine to version control non-zarr data you could get something like that... cc @abarciauskas-bgse @sharkinsspatial

Regardless reading Icechunk data from GDAL as @TomAugspurger suggested still sounds useful, but would require binding to the Icechunk rust client somehow.

TomNicholas avatar Feb 26 '25 05:02 TomNicholas

I think there are two separable discussions happening: 1. Can/should virtualizarr support virtualizing existing VRTs? and 2. Can GDAL work with virtualized Zarr V3? I'm going to focus on 2 :)

The recommended serialisation format for VirtualiZarr is now effectively Icechunk.

Interesting, I haven't been following virtualizarr discussions recently so this is news to me!

reading Icechunk data from GDAL as @TomAugspurger suggested still sounds useful, but would require binding to the Icechunk rust client somehow.

I was hoping https://virtualizarr.readthedocs.io/en/stable/usage.html#writing-as-zarr would be achievable b/c if I understand correctly writing a to-spec Zarr V3 is what would enable the "automatic" compatibility with GDAL...

Icechunk seems great and I'm totally in favor of using it, but for smaller datasets it adds cognitive load (sessions? commits?) and I'm still a bit confused about how GDAL could interact with icechunk stores:

Ideal simple code (but doesn't work :):

combined_vds.virtualize.to_zarr('combined_v3.zarr')

ds = xr.open_dataset('combined_v3.zarr') 

# As mentioned above, if it's written as "compliant zarr v3" other software like gdal could work with it:
!gdalinfo combined_v3.zarr

Alternative code (not clear to me how GDAL can understand an icechunk store):

from icechunk import Repository, local_filesystem_storage

storage = local_filesystem_storage("/tmp/icechunk/store")
repo = Repository.create(storage=storage)
session = repo.writable_session("main") # Note typo in docs: writeable -> writable 
combined_vds.virtualize.to_icechunk(session.store)
snapshot = session.commit("my first virtualized dataset")

ds = xr.open_zarr(session.store, consolidated=False)

# Should this work? How can gdal 'read' an icechunk store / recognize valid zarr v3 format ?
# session.store.get('zarr.json') # TypeError: IcechunkStore.get() missing 1 required positional argument: 'prototype'
!gdalinfo /tmp/icechunk/store ?

scottyhq avatar Feb 26 '25 10:02 scottyhq

I don't think anything but Rust can understand an Icechunk store. GDAL can't read virtualized Zarr. It can read the metadata, but that doesn't help unless you can decode the references and point your reader at them (certainly the ZARR driver can be repurposed for that and I think we're not the only ones talking about it). But with Icechunk stores seem like it's not just creators that need the Rust bindings, that's starting to loom as another problem to me. (... but probably it's just the references that are encoded more opaquely, presumably the actual chunks are as generic as ever).

mdsumner avatar Feb 26 '25 12:02 mdsumner

Can GDAL work with virtualized Zarr V3? I'm going to focus on 2 :)

(It would be nice to discuss this on the icechunk issue tracker instead, because it's not a question that's actually related to the VirtualiZarr package.)

The recommended serialisation format for VirtualiZarr is now effectively Icechunk.

Interesting, I haven't been following virtualizarr discussions recently so this is news to me!

A little more context here. We could still make a lightweight format for virtual zarr, but someone has to really want it, and the icechunk dev team would prefer everyone just get aboard the icechunk train 😁

https://virtualizarr.readthedocs.io/en/stable/usage.html#writing-as-zarr

Wait why is that still in the docs!? It was removed in #426.

If I understand correctly writing a to-spec Zarr V3 is what would enable the "automatic" compatibility with GDAL...

You'll never get "automatic" ability to read virtual zarr of any type, it always is going to require some change to a reader to teach it to understand a manifest.json / icechunk store. The question is just how big of a change it requires, whether or not that change brings in dependencies, and whether the format you're reading has a stable spec.

Icechunk seems great and I'm totally in favor of using it, but for smaller datasets it adds cognitive load (sessions? commits?)

You're not wrong. But it also brings incredibly powerful version control features that a bare manifest doesn't, and whose implementation is complementary to even having a manifest.

and I'm still a bit confused about how GDAL could interact with icechunk stores:

GDAL would have to call an icechunk client to interact with icechunk stores. Currently the only client is the one written in rust (with python bindings) and maintained by Earthmover.

I don't think anything but Rust can understand an Icechunk store.

But Icechunk has an open spec, so if you really really wanted to avoid the rust dependency you could write your own icechunk client in any language. Basically icechunk uses the zarr data model and the zarr-python library, but doesn't use the "native zarr" format that really is an implementation detail of the zarr-python FsspecStore.

but probably it's just the references that are encoded more opaquely, presumably the actual chunks are as generic as ever

Yes the chunk storage is still fairly straightforward, but supporting time-travel means you don't know which chunks to read without also having an implementation of the version-control layer.

TomNicholas avatar Feb 26 '25 14:02 TomNicholas

all good, lots of my lingering qs answered there!

mdsumner avatar Feb 26 '25 14:02 mdsumner

The recommended serialisation format for VirtualiZarr is now effectively Icechunk

FWIW I have been recommending Icechunk serialization of virtual datasets for easily updatable workflows by nimble/forward-looking teams and suggesting people with larger, production-based use-cases prepare to use Icechunk after it reaches v1 (likely within the next couple months). I wouldn't recommend integrating Icechunk into production right now across the board. The pilot project in https://github.com/earth-mover/icechunk-nasa is still finding a pain points so that other people don't have to if they would rather wait for a stable release.

maxrjones avatar Feb 26 '25 15:02 maxrjones

I've come around a bit on this topic, a GDAL multidim VRT is perfect for virtualising. I'm making some VRTs of thredds server datasets. There's no rioxarray for this because rasterio will never do multidim, but we can use osgeo.gdal

There's a bit of technical debt around concatenation with multidim VRTs but it will give a good example to hash about virtualizing, with all the files-as-chunks already identified within the logical array. I can probably see my way though a multiple-inputs to mdimtranslate that works like open_mfdataset, but leap frogging VRT will probably be worth it to get the ideas across.

I've looked at rioxarray and it doesn't seem too much to create a multidim backend that would leverage osgeo.gdal in xarray - would someone like to explore this with me? I'll craft a reasonable sized VRT and explain how VirtualiZarr would use it as a starting point.

mdsumner avatar Mar 13 '25 06:03 mdsumner

here's a reasonable size VRT, of an ocean_salt dataset on a Thredds server. (I've ignored the nv and "edges" parts of the netcdf, because the time is regular and all that was needed was to increment each source offset and the overall "Time" size).

https://github.com/mdsumner/pymdim/blob/main/inst/examples/bluelink_ocean_salt.vrt

The work starts at this list of extra Source nodes which increment by the size of time in each file:

https://github.com/mdsumner/pymdim/blob/5c1132bdc49782109238f709f84dac75ecd37252/inst/examples/bluelink_ocean_salt.vrt#L148-L153

It wouldn't take much to open this with xarray I expect, maybe it's a useful way to have a serialize of scanned files and some GDAL/xarray crossover.

I'm variously using "osgeo.gdal" to interact with the multidim api, and my own code in https://github.com/mdsumner/gdalraster branch "multidimnew". This can also be opened with {stars}::read_mdim, or by loading osgeo.gdal in R and I'll keep exploring that. (Obviously pixel-drill in time is not very performant across many months because Time chunk = "month length", but slices in xy and xz or yz work fine. )

(I'm just hacking the xml with {xml2} for now, it would be nice to have gdalmdimtranslate accept multiple inputs and resolve them this way, which I'll also be working on).

mdsumner avatar Mar 17 '25 04:03 mdsumner

Cool! Might it be possible to write a virtualizarr reader for this just using the python xml standard library module? That's basically what the DMR++ reader we have does.

TomNicholas avatar Mar 17 '25 15:03 TomNicholas

I'll explore, appreciate the pointers, I need them. Mostly I'm exploring the GDAL multidim side a, because it opens this up for R.

mdsumner avatar Mar 17 '25 20:03 mdsumner

Though, the gdalmdiminfo of this is Json which is probably easier, can get from `.GetMetadata("xml:VRT")'

mdsumner avatar Mar 17 '25 20:03 mdsumner

Doh, obviously I meant ds.InfoAsJSON(), after OpenEx() with OF_MULTIDIM_RASTER ...

mdsumner avatar Mar 17 '25 21:03 mdsumner

is it generally possible to get the dmrpp from an opendap source? from say one of these

https://data-cbr.csiro.au/thredds/catalog/catch_all/CMAR_CAWCR-Wave_archive/CAWCR_Wave_Hindcast_aggregate/gridded/catalog.html

https://thredds.nci.org.au/thredds/catalog/gb6/BRAN/BRAN2023/daily/catalog.html?dataset=gb6/BRAN/BRAN2023/daily/atm_flux_diag_2010_01.nc

For BRAN2023 I can generate the references insitu, but I assume these servers have that dmrpp somewhere, apologies I don't know where else to find this out (but will be working with some experts closer to the data I'm interested in soon)

I see examples like this and thought it might just hang off the dodsC url? but I can't find

http://test.opendap.org:8080/opendap/data/dmrpp/chunked_gzipped_fourD.h5.dmrpp.dmr.html

mdsumner avatar Mar 19 '25 11:03 mdsumner

is it generally possible to get the dmrpp from an opendap source?

I have no idea, but one of @ayushnag @betolink @Mikejmnez will know.

TomNicholas avatar Mar 19 '25 13:03 TomNicholas

Hi all, I haven't fully read all the conversation thread here, but I will try to contribute on the immediate thing I know (probably not the best practice, apologies in advance if I misunderstand something ) :) .

is it generally possible to get the dmrpp from an opendap source?

I have no idea, but one of @ayushnag @betolink @Mikejmnez will know.

If by get the dmrpp from an opendap source, you mean inspect it. In theory yes but in practice it it may vary. From the server side, it is configured so that if one appends a dmrpp to a data url associated with the file, it should reproduce the dmrpp on the browser. I know that from the NASA side, there has been a push to hide the dmrpps from the user.

If by get the dmrpp from an opendap source, you mean generate a dmrpp, yes, and it is rather straightforward. But I must clarify that only Hyrax, the opendap server developed by the Opendap group can generate/use dmrp++ (as opposed to Thredds by Unidata, or Grads from COLA, etc).

Mikejmnez avatar Mar 19 '25 15:03 Mikejmnez

That's great thank you, and accords with what I've seen. I mean the first, to be able to inspect it.

Follow up question then is how does OpenDAP actually work? I thought it was using these references pre-stored by some mechanism ~from dmrpp~, and I expected to find some kind of xml-read/parse-chunk-refs code in libdap. Or does sometimes occur dynamically with the dodsC/ interface calling out to fileServer/ to determine it - I had gathered that these references powered the whole OpenDap system ??

Is there something dmrpp-like that other servers use?

mdsumner avatar Mar 19 '25 20:03 mdsumner

I will read that blogpost, thanks for the link!

mdsumner avatar Mar 19 '25 20:03 mdsumner

I think I'm learning that OpenDAP works via dmrpp now, but in the past used traditional netcdf api dynamically.

mdsumner avatar Mar 19 '25 20:03 mdsumner

@mdsumner

Follow up question then is how does OpenDAP actually work? I thought it was using these references pre-stored by some mechanism from dmrpp, and I expected to find some kind of xml-read/parse-chunk-refs code in libdap. Or does sometimes occur dynamically with the dodsC/ interface calling out to fileServer/ to determine it - I had gathered that these references powered the whole OpenDap system ?? The specifics of how opendap users dmrpp are, imho, beyond my level of expertise related to opendap/hyrax servers. I know there is an orchestration layer within the BES that creates/parses/checks etc the dmrpps when data is on S3 (as opposed to the traditional approach by handlers when server is on premises)

I know there is a somewhat complex orchestration layer within the BES of the OPeNDAP/hyrax data server. There, parsing/checking/etc of dmrpps takes place for when data is on S3, as opposed to the traditional data handler approach of on-premises data servers (Hyrax can also work on premise stil). But that is a little beyond my own relatively newbie understanding of theHyrax data server. It is open source, and you are welcome to inquire more about this on the opendap github project, so it does not dominate this conversation on GDAL :) .

Is there something dmrpp-like that other servers use?

DMRpp (dmr++) is a combination of DMR (which is a DAP4 metadata representation), and the ++ part where the chunk references come into play. Essentially a DMR++ is a DMR with extra stuff, and a DMR is a DAP4 replacement of the DDS/DAS DAP2 metadata representation. And so other servers that implement the DAP4 protocol would use something like the dmrpp, but in reality it would only be the DMR part. The ++ part was specifically envisioned / developed by the OPeNDAP group for Hyrax when NASA first started considering moving data to the cloud.

Thredds (Unidata's) data servers do implement DAP4, but again, the dmrpp-like object/behavior you'd find is the DMR part (without the extra ++ stuff). FYI any Thredds URL with dodsC in them is DAP2. In theory, if you replace the dodsC part of the URL with dap4, it should render the dap4 version of the data response form...

Mikejmnez avatar Mar 21 '25 17:03 Mikejmnez

Awesome, much appreciated!

mdsumner avatar Mar 21 '25 20:03 mdsumner

there's some super cool stuff coming with new GDALG driver and the new cli tools

https://gdal.org/en/latest/drivers/raster/gdalg.html

https://gdal.org/en/latest/programs/gdal_cli_from_python.html#gdal-cli-from-python

Very programmable to write these tiny json pipelines, and will be much more attractive than VRT per se:

import rioxarray

## this is a GDALG pipeline read/reproject to local crs of GEBCO 2024 using GDAL Streaming Algorithm (brand-new in dev)
dsn = "https://gist.githubusercontent.com/mdsumner/e1f5e4d24705080aa5e94300778e6fcb/raw/f4cd9b0e2f67a4bd47e2be709a9d712dd148300c/gebco_tas_vic.gdalg.json"

rioxarray.open_rasterio(dsn)
<xarray.DataArray 'elevation' (band: 1, y: 1333, x: 866)> Size: 2MB
[1154378 values with dtype=int16]
Coordinates:
  * band     (band) int64 8B 1
  * x        (x) float64 7kB 2.103e+06 2.104e+06 ... 2.967e+06 2.968e+06
  * y        (y) float64 11kB 2.931e+06 2.93e+06 2.929e+06 ... 1.6e+06 1.599e+06
    crs      int64 8B 0
Attributes: (12/63)
    AREA_OR_POINT:                   Area
    lat#axis:                        Y
    lat#long_name:                   latitude
    lat#sdn_parameter_name:          Latitude north
    lat#sdn_parameter_urn:           SDN:P01::ALATZZ01
    lat#sdn_uom_name:                Degrees north
    ...                              ...
    sdn_uom_urn:                     SDN:P06::ULAA
    standard_name:                   height_above_mean_sea_level
    units:                           m
    _FillValue:                      -32767
    scale_factor:                    1.0
    add_offset:                      0.0

mdsumner avatar Mar 21 '25 21:03 mdsumner