exactextract
exactextract copied to clipboard
Add Python Bindings
Hi there! This is a wonderful project and wanted to take the time to create a Python wrapper (#7) to use in my project. I additionally included a set of unit tests and documentation for the wrapper. Let me know if you'd like to get this merged. Thanks!
Jon
Thank you for this! It'll take me a bit of time to go through this, but overall I'm much in favor. There has been quite a bit of demand for Python bindings.
cc @rsignell-usgs
Very helpful feedback @hellkite500 ! I'll do some revisions and push the changes soon.
Some of the questions on my mind are:
- Is this the right level of detail to expose, or should we prefer a higher-level interface such as the one exposed in
exactextractr
? - Do classes such as
Box
,Grid
, andOperation
have sufficient value to Python users that it is worth exposing them, thereby imposing a stability requirement on their interface? - Should we deal with types such as
gdal.Dataset
andgdal.Band
rather than exporting classes such asGDALRasterWrapper
to Python? Similarly, would producing a pandas data frame be preferable to writing a CSV?
- Is this the right level of detail to expose, or should we prefer a higher-level interface such as the one exposed in
exactextractr
?
Are you suggesting expoing a single function interface which takes some "standard" python representation of inputs and outputs a single dataframe (or possibly a geodataframe?), similar to this one in exactextractr
- Do classes such as
Box
,Grid
, andOperation
have sufficient value to Python users that it is worth exposing them, thereby imposing a stability requirement on their interface?
I think some of these definitely do, especially the Grid.
. The Operation
class, could be argued that exposing that to Python may not be required, as most Operation
s could be proxyed to use the user behind some convenience functions/scripts.
That being said, these are part of the public
library API, so are you suggesting that libexactextract
isn't intended to be stable? Also, we can pin the python version to the library version, and it would just follow the semantic versioning of the lib. If the public API of the C++ classes change, that changes the libexactextract
semantic version, forcing a change in the python version (and possible the binding code supporting it). As far as ensuring the python bindings stay up-to-date as the public API of the lib evolves, it requires testing the bindings and updating them as the API they bind changes. Is that level of maintenance a concern here in this repository? We could do that via a separate repository and pin an exactextract
git submodule for the stable version of the lib we are binding.
- Should we deal with types such as
gdal.Dataset
andgdal.Band
rather than exporting classes such asGDALRasterWrapper
to Python? Similarly, would producing a pandas data frame be preferable to writing a CSV?
I would be interested in looking at fiona
, geopandas
, and python gdal
practices and see if working with something a little more "native" would make sense and be practical.
Are you suggesting expoing a single function interface which takes some "standard" python representation of inputs and outputs a single dataframe (or possibly a geodataframe?), similar to this one in exactextractr
My default implementation design would be a Python function that largely mimics the exact_extract
function in R. I think this has been fairly successful in R, so I wouldn't change it without a reason to do so. The implementation of that Python function may use what I consider to be low-level classes (Grid
, Operation
, etc.) but I would not expect a Python end user to work with these directly. It might make sense to move some of the logic in the R package into this repository; for example, code supporting the re-use of calculated coverage fractions for multiple raster inputs.
are you suggesting that libexactextract isn't intended to be stable?
It has no versioned releases, so I don't think there's a particular stability contract.
It has no versioned releases, so I don't think there's a particular stability contract.
Fair enough, I never really thought about it that way, haha.
Thanks for indulging this discussion, as it helps clear up the various use cases and the best way to serve them. I'm definitely in a weird place thinking through this having spent some time really trying to understand the backend (exactextract) and the the various ways to use it, especially WRT coverage fraction and its extraction/reuse. Would be happy to pick up that discussion on a separate thread.
I'm still kind of curious if there are use cases to take advantage of the Grid
abstractions outside the context of just raster stats. @jdalrym2 as the OP of this PR, did you have some intended use case outside what you walked through in the documentation sections of this PR? I think you did essentially what @dbaston was proposing in exposing a python interface to mimic the CLI
interface, but I think the proposition is to move that implementation more into a standard C++ function/class and use that to serve to the bindings to both R and Python.
Sorry I'm late to the conversation.
Is this the right level of detail to expose, or should we prefer a higher-level interface such as the one exposed in exactextractr?
Admittedly my process of binding this was just writing out the bindings for classes until I figured out how to make a workflow that worked. I didn't know enough of the library to make high level decisions like this. I actually didn't know exactextractr
existed until I was almost done. I will look into your interface there and form an opinion. It makes a lot of sense to mirror this from a maintainability perspective. My only holdup would be if there are Python antipatterns in your interface there. I'll return to this.
Do classes such as Box, Grid, and Operation have sufficient value to Python users that it is worth exposing them, thereby imposing a stability requirement on their interface?
Operation is needed for the way I'm binding things now (but this can change), but I'd actually agree Box
and Grid
might be too low-level for this interface (@hellkite500 I'm curious if you have a proposed Python use-case for Grid
). I bound those classes first and realized toward the end that they weren't needed for the final interface / workflow.
Should we deal with types such as gdal.Dataset and gdal.Band rather than exporting classes such as GDALRasterWrapper to Python? Similarly, would producing a pandas data frame be preferable to writing a CSV?
I deal with those types with my adapter classes (in the pure Python portion). Handling them directly without an adapter and without exposing GDALRasterWrapper
, etc, requires passing off memory addresses through the pybind interface (at least that's the only way I've been able to make it work), and it ended up being too hacky for comfort. Exposing GDALRasterWrapper
and writing Python code to help the user make these seemed like the best bet.
Further down the discussion:
It has no versioned releases, so I don't think there's a particular stability contract.
I'm curious if you'd like to start now. Having specific versioned releases could make this a good candidate for hosting on PyPI. Another option would be to move all of this to a separate repo and keeping the main repo as a submodule. The separate repo would have versioned releases tied to specific pinned commit hashes in this repo.
@jdalrym2 as the OP of this PR, did you have some intended use case outside what you walked through in the documentation sections of this PR?
No I didn't. I just started binding things until I got a solid workflow going. Admittedly little forethought. I actually considered removing the bindings that aren't directly in the workflow before submitting this PR, but decided against it pending feedback as to not erase work.
... but I think the proposition is to move that implementation more into a standard C++ function/class and use that to serve to the bindings to both R and Python.
I think this is the ideal move, but I'm definitely hesitant based on how much work this might involve. Perhaps you or @dbaston are rockstar C++ developers that could knock this out in an hour, but for me that'd be much, much longer. ;)
I'll return to this.
That was actually less to review than I thought. I tend not to be a fan of "all in one go" function calls, in-case there is setup overhead and the user would like to reuse existing classes or state. Idk if that really applies here, but I tend to lean toward a "setup first, then go second" sort of design pattern as opposed to massive one-liners. I don't really think there's a right answer here, and honestly I like the idea of doing both: a high level interface that mirrors exactextractr
closely that calls a somewhat lower-level, but still exposed interface in case the user desires more granularity.
@jdalrym2 I've recently tested the python bindings with a multi-band .tif file, and the processor module does not appear to be correctly processing multiple raster bands within the .tif file together properly when specified. Essentially, when a user initializes multiple rasters and operations for two separate bands within a .tif file, they are not properly processed within the "FeatureSequentialProcessor" step as you call both operations (e.g. FeatureSequentialProcessor(dsw, writer, [op1,op2]) for two separate raster bands. The processor output essentially only yields the first raster band output for both operations. However, when each raster operation is processed separately (op1, op2), they both produce the correct output values (when compared to results with original C+ ExactExtract executable). My only guess to this initial problem is that the band index isn't being properly processed with the "processor.py" module when multiple raster operations are ingested, but I am trying to narrow down potential coding problem now. I just wanted to loop you into this problem and see if you had any suggestions as to what may going on here. I would mention that the functionality for multiple rasters within a netcdf file does work properly for the ExactExtract python bindings, so this issue is isolated only for .tif files currently.
@jduckerOWP Thanks for the info and for looking into this! FWIW, I might not get to take a look for a couple days, but will follow up if I find anything.
Overall: I agree it's not behaving as expected and I don't believe I have tested my bindings on anything other than a single band so far.
UPDATE (@jduckerOWP ) I was able to write a unit test to replicate your issue and fix it. Please see commit e278076 for more details.
Interestingly, the bug seemed to be in the exactextract code itself (@dbaston pls look), where Operation
objects were keyed in the StatsRegistry
by the name
of their parent RasterSource
(i.e. GDALRasterWrapper
). However, in the case of TIF files, the band number did not exist in the name
, and so keys clashed for multiple operations that dealt with the same TIF. The solution was simply to append "|[bandnum]"
to the name of the parent GDALRasterWrapper
object on init.
It is a mystery to me why exactextract CLI doesn't run into this issue. I did not dig into that, however.
@jdalrym2 Thank you for looking into this issue further! I've been stumped as to narrowing down the exact problem due to the fact that the exactextract CLI did not reproduce the same issue (which led me to believe that it had to initially had to be a bug in the python bindings, although nothing stood out as an issue through a debugger). The details you've described here completely make sense to me now with the underlying issue within the exactextract code. I'm glad it wasn't a significant bug in the exactextract code, but rather the way the operation objects were created with the proper register keys for each band of the TIF file. I have made the adjustments to the gdal_raster_wrapper.cpp file and have confirmed from my end as well that this code does indeed resolve the issues with specifying band numbers (tested TIF file with 8 different bands, all came out correctly) for a TIF file. I appreciate the time and support resolving this issue quickly with the python bindings functionality for TIF files. @hellkite500 is working on revising the ExactExtract branch that includes coverage fraction weights file output to be included as part of the master branch functionality for ExactExtract. At that point, we can reproduce the python bindings as well for this functionality in the future. Thanks again your collaboration efforts with this project!
$ git clone -b add-pybindings-jd --single-branch https://github.com/jdalrym2/exactextract.git
...
$ cmake -DBUILD_CLI:=OFF -DBUILD_DOC:=OFF -DBUILD_PYTHON:=ON -DCMAKE_BUILD_TYPE=Release ..
$ make
...
[100%] Linking CXX shared module _exactextract.cpython-36m-x86_64-linux-gnu.so
lto-wrapper: warning: using serial compilation of 9 LTRANS jobs
lto-wrapper: note: see the ‘-flto’ option documentation for more information
[100%] Built target _exactextract
61 81.3s exactextract:/cmake-build-release (add-pybindings-jd|✔) $ make install
make: *** No rule to make target 'install'. Stop.
It looks like I end up with two files:
- libexactextract.a
- _exactextract.cpython-36m-x86_64-linux-gnu.so
Where do I copy them to install it? I'm using a multi-stage docker setup with osgeo/gdal:ubuntu-small as builder and final base image
I'm guessing:
- /usr/local or /usr/include/ ?
- can I copy _exactextract.cpython-36m-x86_64-linux-gnu.so directly into a virtual_env? or does it go somewhere else ?
@chapmanjacobd Please try the instructions in the README.md in the python/ directory (using python3 bdist). That should resolve the placement of files automatically within your environment.
If that doesn't work feel free to follow up.
Ohh... the python bindings don't support Virtual File Systems?
def get_ds_path(ds: Union[gdal.Dataset, ogr.DataSource]):
"""
Get the file path of an OSGeo dataset
Args:
ds (Union[gdal.Dataset, ogr.DataSource]): Input OSGeo dataset
Returns:
str: File path to dataset
"""
if isinstance(ds, gdal.Dataset):
return ds.GetFileList()[0] # type: ignore
else:
return ds.GetName()
https://gdal.org/doxygen/classGDALDataset.html#aefe21eb2d1c87304a14f81ee2e6d9717
I am extremely grateful that these python bindings exist but also I'm a little surprised by this. If I pass in a dataset and GDAL is doing fine, does it not make sense to pass that data through rather than try to re-open it?
@chapmanjacobd Direct passthrough of the GDALDataset object isn't possible due to how the different Python bindings work (I tried, it got hacky really quick). So a path must be extracted and then passed through with a new GDALDataset object created on the C++ side. I'd expect a virtual path would work as well... just haven't tested or implemented it. These bindings are pretty new, and were narrowly scoped for what I needed to do.
Feel free to implement it if you so wish and PR to my branch and/or write an issue and I can go take a look when I get the time. Thanks!
Just to provide an update here: VFS works for gdal.Open
-able paths with this patch. Thanks again for putting this together ! :)