Apache Sedona translate from geotiff to COG
Expected Scenario
We're trying to translate tiff files to COG. Is it possible to use the Apache Sedona to do the translation?
If yes, is there any code reference that we can look at.
Settings
Sedona version = 1.3.1
Apache Spark version = 3.2
API type = Python (PySpark)?
Environment = Azure, Databricks, Synapse
In Sedona 1.4 we've added a new raster type. Its basically a serialized GridCoverage2D (from GeoTools).
You can use the new api to load the tiff files: https://sedona.apache.org/latest-snapshot/api/sql/Raster-loader/#rs_fromgeotiff
Sedona has no function to create COG files yet. You could implement it yourself as a udf. If you are willing to send a PR to Sedona - I'm sure the community would be very happy!
I did some more research. GDAL is able to create COGs and rasterio is a python wrapper around GDAL. You should be able to convert GeoTiffs to COGs in rasterio. If not there is a rasterio plugin to do that. https://github.com/cogeotiff/rio-cogeo
You can create a python udf that does the conversion. It will be painfully slow and consume lots of memory.
The other option is to implement the udf in a JVM language. Geosolutions has developed a GDAL wrapper in Java, which you can access at https://github.com/geosolutions-it/imageio-ext.
@umartin Hi Martin, I have been thinking about the COG format for a while. It is a bit hard to find a full-fledged java reader/writer for COG.
GeoTools does not have it. Image-io ext (https://github.com/geosolutions-it/imageio-ext/tree/master/plugin/cog) has a reader but no writer.
The only way to support both is GDAL java api. Do you think it is a good idea to include GDAL java as a dependency?
GDAL has comprehensive support for raster format: https://gdal.org/drivers/raster/index.html If we can use it, this will open the entire raster world for Sedona. Raster data support could be the next big thing for Sedona.
@jiayuasu I think that would be a good idea. It’s not ideal to depend on a native LGPL library but I think it’s our only option right now. I guess you are thinking about adding something like this: https://postgis.net/docs/RT_ST_AsGDALRaster.html
If we can write a jvm implementation of COG at a later time we can simply add RS_AsCOG
We plan to write a COG writer in Apache SIS (it already has a COG reader). I can not promise when because it depends on external factors, but if we get selected in OGC Testbed-19 we may develop the COG writer this summer. The COG reader in Apache SIS has been developed as part of OGC Testbed-17, so the same thing may happen with Testbed-19 because a task item will be to develop a GeoTIFF writer for extraterrestrial use. This is an area where Apache SIS is well positioned, because it has one of the most advanced Referencing by Coordinates framework available in open source (Apache SIS was used as a proof of concept in Testbed-18 for referencing objects in space).
It brings back the topic of Sedona dependency to GeoTools. LGPL dependencies are normally not allowed in Apache projects unless they are optional. But given the importance of geospatial services for Sedona, it seems to me that a significant fraction of Sedona functionalities depends on GeoTools, and this fraction may be growing as new geospatial features are added in Sedona. GeoTools can be advantageously replaced by Apache SIS for metadata, referencing services and raster support. The GridCoverage2D class in GeoTools exists also in Apache SIS (actually I'm the original author of that class in GeoTools). The main drawback is the smaller amount of supported formats. But if bindings to GDAL are used as a temporary solution for getting a COG writer before it become supported in Apache SIS, that GDAL binding could be used for other formats as well.
I will be physically present at the joint OGC/OSGeo/ASF code sprint in Switzerland, April 25 to 27. If anyone from the Sedona community plan to be there, maybe it would be an opportunity to explore the feasibility of GeoTools replacement by Apache SIS?
@desruisseaux
That's really exciting news that you're considering implementing a COG writer! I'm curious, would this also mean that Apache SIS would support writing "regular" GeoTIFFs as well? If the COG writer is successfully implemented and backported to a Java 8 release of Apache SIS, I would love to hear about it. With write support for GeoTIFF, Apache SIS would be on feature parity with GeoTools, at least from Sedona's perspective. And with write support for COG, it would give it an advantage.
If you do release a Java 8 version with these features, I'd be happy to set up a Sedona branch that replaces GeoTools with Apache SIS, we could run some benchmarks and have a new discussion in the Sedona community.
Although I won't be able to attend the OGC/OSGeo/ASF code sprint in Switzerland myself, I appreciate you letting me know about it. Thanks!
@desruisseaux For the testing purpose, I think even a SNAPSHOT Java 8 release of Apache SIS is good enough. We can use that to confirm the integration between SIS and Sedona.
Hello @umartin and @jiayuasu, thanks for your reply!
Actually COG is not a file format, but rather a set of good practices in the way to encode GeoTIFF files. For example the TIFF format allows metadata to be located anywhere in the file, but COG restricts them to the beginning of the file. TIFF is very flexible on the way to organize tiles and overviews, but COG put some restrictions on the layout, etc. Consequently all COG files are also regular GeoTIFF files, only restricted to a subset of TIFF flexibility. For a first version, I see no reason to not produce COG files unconditionally.
In a future version, there is some reasons why we may sometime want a non-COG file. For example if we want to create a GeoTIFF file in append mode, i.e. add tiles to an existing file in random order, the COG layout is not suitable for that. But GeoTIFF stay well suited if we forget COG for that particular scenario.
For testing purpose, the SIS 1.3 release is Java 8 compatible and has the COG reader, optionally with Amazon S3 support. SIS 1.4-SNAPSHOT has some improvements and bug fixes, but requires Java 11. A port to Java 8 may be possible, but before to do that, if Sedona can be tested in Java 11, it would be useful for determining if the current set of SIS functionalities is suitable to Sedona.
Note: if testing with a large COG file, the following note may be useful: Handle rasters bigger than memory. In summary SIS supports immediate or deferred data loading. The former may be more intuitive and is the default for that reason, the latter scale better.