exactextract icon indicating copy to clipboard operation
exactextract copied to clipboard

Add Python Bindings

Open jdalrym2 opened this issue 2 years ago • 8 comments

Hi there! This is a wonderful project and wanted to take the time to create a Python wrapper (#7) to use in my project. I additionally included a set of unit tests and documentation for the wrapper. Let me know if you'd like to get this merged. Thanks!

Jon

jdalrym2 avatar Jul 05 '22 01:07 jdalrym2

Thank you for this! It'll take me a bit of time to go through this, but overall I'm much in favor. There has been quite a bit of demand for Python bindings.

cc @rsignell-usgs

dbaston avatar Jul 05 '22 14:07 dbaston

Very helpful feedback @hellkite500 ! I'll do some revisions and push the changes soon.

jdalrym2 avatar Jul 09 '22 06:07 jdalrym2

Some of the questions on my mind are:

  • Is this the right level of detail to expose, or should we prefer a higher-level interface such as the one exposed in exactextractr?
  • Do classes such as Box, Grid, and Operation have sufficient value to Python users that it is worth exposing them, thereby imposing a stability requirement on their interface?
  • Should we deal with types such as gdal.Dataset and gdal.Bandrather than exporting classes such as GDALRasterWrapper to Python? Similarly, would producing a pandas data frame be preferable to writing a CSV?

dbaston avatar Jul 09 '22 23:07 dbaston

Are you suggesting expoing a single function interface which takes some "standard" python representation of inputs and outputs a single dataframe (or possibly a geodataframe?), similar to this one in exactextractr

  • Do classes such as Box, Grid, and Operation have sufficient value to Python users that it is worth exposing them, thereby imposing a stability requirement on their interface?

I think some of these definitely do, especially the Grid.. The Operation class, could be argued that exposing that to Python may not be required, as most Operations could be proxyed to use the user behind some convenience functions/scripts.

That being said, these are part of the public library API, so are you suggesting that libexactextract isn't intended to be stable? Also, we can pin the python version to the library version, and it would just follow the semantic versioning of the lib. If the public API of the C++ classes change, that changes the libexactextract semantic version, forcing a change in the python version (and possible the binding code supporting it). As far as ensuring the python bindings stay up-to-date as the public API of the lib evolves, it requires testing the bindings and updating them as the API they bind changes. Is that level of maintenance a concern here in this repository? We could do that via a separate repository and pin an exactextract git submodule for the stable version of the lib we are binding.

  • Should we deal with types such as gdal.Dataset and gdal.Bandrather than exporting classes such as GDALRasterWrapper to Python? Similarly, would producing a pandas data frame be preferable to writing a CSV?

I would be interested in looking at fiona, geopandas, and python gdal practices and see if working with something a little more "native" would make sense and be practical.

hellkite500 avatar Jul 12 '22 21:07 hellkite500

Are you suggesting expoing a single function interface which takes some "standard" python representation of inputs and outputs a single dataframe (or possibly a geodataframe?), similar to this one in exactextractr

My default implementation design would be a Python function that largely mimics the exact_extract function in R. I think this has been fairly successful in R, so I wouldn't change it without a reason to do so. The implementation of that Python function may use what I consider to be low-level classes (Grid, Operation, etc.) but I would not expect a Python end user to work with these directly. It might make sense to move some of the logic in the R package into this repository; for example, code supporting the re-use of calculated coverage fractions for multiple raster inputs.

are you suggesting that libexactextract isn't intended to be stable?

It has no versioned releases, so I don't think there's a particular stability contract.

dbaston avatar Jul 13 '22 19:07 dbaston

It has no versioned releases, so I don't think there's a particular stability contract.

Fair enough, I never really thought about it that way, haha.

Thanks for indulging this discussion, as it helps clear up the various use cases and the best way to serve them. I'm definitely in a weird place thinking through this having spent some time really trying to understand the backend (exactextract) and the the various ways to use it, especially WRT coverage fraction and its extraction/reuse. Would be happy to pick up that discussion on a separate thread.

I'm still kind of curious if there are use cases to take advantage of the Grid abstractions outside the context of just raster stats. @jdalrym2 as the OP of this PR, did you have some intended use case outside what you walked through in the documentation sections of this PR? I think you did essentially what @dbaston was proposing in exposing a python interface to mimic the CLI interface, but I think the proposition is to move that implementation more into a standard C++ function/class and use that to serve to the bindings to both R and Python.

hellkite500 avatar Jul 13 '22 19:07 hellkite500

Sorry I'm late to the conversation.

Is this the right level of detail to expose, or should we prefer a higher-level interface such as the one exposed in exactextractr?

Admittedly my process of binding this was just writing out the bindings for classes until I figured out how to make a workflow that worked. I didn't know enough of the library to make high level decisions like this. I actually didn't know exactextractr existed until I was almost done. I will look into your interface there and form an opinion. It makes a lot of sense to mirror this from a maintainability perspective. My only holdup would be if there are Python antipatterns in your interface there. I'll return to this.

Do classes such as Box, Grid, and Operation have sufficient value to Python users that it is worth exposing them, thereby imposing a stability requirement on their interface?

Operation is needed for the way I'm binding things now (but this can change), but I'd actually agree Box and Grid might be too low-level for this interface (@hellkite500 I'm curious if you have a proposed Python use-case for Grid). I bound those classes first and realized toward the end that they weren't needed for the final interface / workflow.

Should we deal with types such as gdal.Dataset and gdal.Band rather than exporting classes such as GDALRasterWrapper to Python? Similarly, would producing a pandas data frame be preferable to writing a CSV?

I deal with those types with my adapter classes (in the pure Python portion). Handling them directly without an adapter and without exposing GDALRasterWrapper, etc, requires passing off memory addresses through the pybind interface (at least that's the only way I've been able to make it work), and it ended up being too hacky for comfort. Exposing GDALRasterWrapper and writing Python code to help the user make these seemed like the best bet.

Further down the discussion:

It has no versioned releases, so I don't think there's a particular stability contract.

I'm curious if you'd like to start now. Having specific versioned releases could make this a good candidate for hosting on PyPI. Another option would be to move all of this to a separate repo and keeping the main repo as a submodule. The separate repo would have versioned releases tied to specific pinned commit hashes in this repo.

@jdalrym2 as the OP of this PR, did you have some intended use case outside what you walked through in the documentation sections of this PR?

No I didn't. I just started binding things until I got a solid workflow going. Admittedly little forethought. I actually considered removing the bindings that aren't directly in the workflow before submitting this PR, but decided against it pending feedback as to not erase work.

... but I think the proposition is to move that implementation more into a standard C++ function/class and use that to serve to the bindings to both R and Python.

I think this is the ideal move, but I'm definitely hesitant based on how much work this might involve. Perhaps you or @dbaston are rockstar C++ developers that could knock this out in an hour, but for me that'd be much, much longer. ;)

jdalrym2 avatar Jul 14 '22 01:07 jdalrym2

I'll return to this.

That was actually less to review than I thought. I tend not to be a fan of "all in one go" function calls, in-case there is setup overhead and the user would like to reuse existing classes or state. Idk if that really applies here, but I tend to lean toward a "setup first, then go second" sort of design pattern as opposed to massive one-liners. I don't really think there's a right answer here, and honestly I like the idea of doing both: a high level interface that mirrors exactextractr closely that calls a somewhat lower-level, but still exposed interface in case the user desires more granularity.

jdalrym2 avatar Jul 14 '22 01:07 jdalrym2

@jdalrym2 I've recently tested the python bindings with a multi-band .tif file, and the processor module does not appear to be correctly processing multiple raster bands within the .tif file together properly when specified. Essentially, when a user initializes multiple rasters and operations for two separate bands within a .tif file, they are not properly processed within the "FeatureSequentialProcessor" step as you call both operations (e.g. FeatureSequentialProcessor(dsw, writer, [op1,op2]) for two separate raster bands. The processor output essentially only yields the first raster band output for both operations. However, when each raster operation is processed separately (op1, op2), they both produce the correct output values (when compared to results with original C+ ExactExtract executable). My only guess to this initial problem is that the band index isn't being properly processed with the "processor.py" module when multiple raster operations are ingested, but I am trying to narrow down potential coding problem now. I just wanted to loop you into this problem and see if you had any suggestions as to what may going on here. I would mention that the functionality for multiple rasters within a netcdf file does work properly for the ExactExtract python bindings, so this issue is isolated only for .tif files currently.

jduckerOWP avatar Nov 02 '22 19:11 jduckerOWP

@jduckerOWP Thanks for the info and for looking into this! FWIW, I might not get to take a look for a couple days, but will follow up if I find anything.

Overall: I agree it's not behaving as expected and I don't believe I have tested my bindings on anything other than a single band so far.

jdalrym2 avatar Nov 02 '22 21:11 jdalrym2

UPDATE (@jduckerOWP ) I was able to write a unit test to replicate your issue and fix it. Please see commit e278076 for more details.

Interestingly, the bug seemed to be in the exactextract code itself (@dbaston pls look), where Operation objects were keyed in the StatsRegistry by the name of their parent RasterSource (i.e. GDALRasterWrapper). However, in the case of TIF files, the band number did not exist in the name, and so keys clashed for multiple operations that dealt with the same TIF. The solution was simply to append "|[bandnum]" to the name of the parent GDALRasterWrapper object on init.

It is a mystery to me why exactextract CLI doesn't run into this issue. I did not dig into that, however.

jdalrym2 avatar Nov 06 '22 03:11 jdalrym2

@jdalrym2 Thank you for looking into this issue further! I've been stumped as to narrowing down the exact problem due to the fact that the exactextract CLI did not reproduce the same issue (which led me to believe that it had to initially had to be a bug in the python bindings, although nothing stood out as an issue through a debugger). The details you've described here completely make sense to me now with the underlying issue within the exactextract code. I'm glad it wasn't a significant bug in the exactextract code, but rather the way the operation objects were created with the proper register keys for each band of the TIF file. I have made the adjustments to the gdal_raster_wrapper.cpp file and have confirmed from my end as well that this code does indeed resolve the issues with specifying band numbers (tested TIF file with 8 different bands, all came out correctly) for a TIF file. I appreciate the time and support resolving this issue quickly with the python bindings functionality for TIF files. @hellkite500 is working on revising the ExactExtract branch that includes coverage fraction weights file output to be included as part of the master branch functionality for ExactExtract. At that point, we can reproduce the python bindings as well for this functionality in the future. Thanks again your collaboration efforts with this project!

jduckerOWP avatar Nov 07 '22 17:11 jduckerOWP

$ git clone -b add-pybindings-jd --single-branch https://github.com/jdalrym2/exactextract.git
...
$ cmake -DBUILD_CLI:=OFF -DBUILD_DOC:=OFF -DBUILD_PYTHON:=ON -DCMAKE_BUILD_TYPE=Release ..
$ make
...
[100%] Linking CXX shared module _exactextract.cpython-36m-x86_64-linux-gnu.so
lto-wrapper: warning: using serial compilation of 9 LTRANS jobs
lto-wrapper: note: see the ‘-flto’ option documentation for more information
[100%] Built target _exactextract
61 81.3s exactextract:/cmake-build-release (add-pybindings-jd|✔) $ make install
make: *** No rule to make target 'install'.  Stop.

It looks like I end up with two files:

  • libexactextract.a
  • _exactextract.cpython-36m-x86_64-linux-gnu.so

Where do I copy them to install it? I'm using a multi-stage docker setup with osgeo/gdal:ubuntu-small as builder and final base image

I'm guessing:

  • /usr/local or /usr/include/ ?
  • can I copy _exactextract.cpython-36m-x86_64-linux-gnu.so directly into a virtual_env? or does it go somewhere else ?

chapmanjacobd avatar Dec 06 '22 07:12 chapmanjacobd

@chapmanjacobd Please try the instructions in the README.md in the python/ directory (using python3 bdist). That should resolve the placement of files automatically within your environment.

If that doesn't work feel free to follow up.

jdalrym2 avatar Dec 06 '22 07:12 jdalrym2

Ohh... the python bindings don't support Virtual File Systems?

def get_ds_path(ds: Union[gdal.Dataset, ogr.DataSource]):
    """
    Get the file path of an OSGeo dataset

    Args:
        ds (Union[gdal.Dataset, ogr.DataSource]): Input OSGeo dataset

    Returns:
        str: File path to dataset
    """
    if isinstance(ds, gdal.Dataset):
        return ds.GetFileList()[0]     # type: ignore
    else:
        return ds.GetName()

https://gdal.org/doxygen/classGDALDataset.html#aefe21eb2d1c87304a14f81ee2e6d9717

I am extremely grateful that these python bindings exist but also I'm a little surprised by this. If I pass in a dataset and GDAL is doing fine, does it not make sense to pass that data through rather than try to re-open it?

chapmanjacobd avatar Dec 06 '22 17:12 chapmanjacobd

@chapmanjacobd Direct passthrough of the GDALDataset object isn't possible due to how the different Python bindings work (I tried, it got hacky really quick). So a path must be extracted and then passed through with a new GDALDataset object created on the C++ side. I'd expect a virtual path would work as well... just haven't tested or implemented it. These bindings are pretty new, and were narrowly scoped for what I needed to do.

Feel free to implement it if you so wish and PR to my branch and/or write an issue and I can go take a look when I get the time. Thanks!

jdalrym2 avatar Dec 06 '22 18:12 jdalrym2

Just to provide an update here: VFS works for gdal.Open-able paths with this patch. Thanks again for putting this together ! :)

chapmanjacobd avatar Dec 09 '22 19:12 chapmanjacobd