h3-py icon indicating copy to clipboard operation
h3-py copied to clipboard

Should the vectorized geo_to_h3 stay as unstable?

Open thomasaarholt opened this issue 4 years ago • 7 comments

#147 introduced vectorized support for geo_to_h3, which sped things up by about 50% over a python for loop. It's currently under unstable, which raises a warning. It seems that geo_to_h3 gives accurate results, and that its API won't retroactively break in the future. @ajfriend do you agree?

Can I suggest that the functionality gets moved to api instead of unstable? To me, the warning from unstable implied that it may not give correct results, which does not seem to be the case.

thomasaarholt avatar Nov 09 '21 11:11 thomasaarholt

Maybe we should update the warning, as unstable was meant to imply that the API will probably change in the future. (Happy to accept a PR on this, btw!)

Also, we're planning on a 4.0 release when the core library also moves to 4.0, so many of the function names will change anyway and break backwards compatibility. For 4.0, I would like to settle on how exactly we'd like to implement the vectorized functions. For example, it'd be nice if we could have them work like numpy functions and dispatch correctly to both vector and scalar arguments. (Also happy to accept PRs to play around in the unstable namespace to try out different approaches for this.)

ajfriend avatar Nov 09 '21 17:11 ajfriend

For 4.0, I would like to settle on how exactly we'd like to implement the vectorized functions.

I think 4.0 is coming up pretty soon 😄 . I'd love to get more of the H3 API to have a vectorized counterpart. My current thoughts on this are:

  • Add more .pxd files to prevent duplicating Cython wrapper code for vectorized functions. The existing approach for the few vectorized functions calls into h3lib directly. https://github.com/uber/h3-py/blob/034a41e6237a850dec7f984963d87603c9342810/src/h3/_cy/unstable_vect.pyx#L59-L60 This isn't horrible for simple functions like geo_to_h3, but it's not ideal for other functions that require more Cython wrapper code around h3lib. Adding .pxd definition files (that define headers for our "exported" cpdef functions) for our .pyx implementation files will allow us to import those .pyx files from other Cython files. (This would also be relevant for https://github.com/uber/h3-py/issues/152).

    So instead of calling h3lib directly, we'd call our own internal wrapper functions:

    rom .geo import geo_to_h3
    
    ..
    ith nogil:
       for i in range(len(lat)):
           out[i] = geo_to_h3(lat[i], lng[i])
    
  • Manually defining vectorized functions is unavoidable. Unlike the API functions file in Python, the vectorized functions must be defined in Cython. I assume function reuse won't be possible and these functions will just have to be copied to the vectorized API.

  • Focus on supporting scalar and 1-D numpy data. Pure numpy ufuncs can be used on data of any dimension, but I think the extra complexity of trying to support more than 1D input might not be worth it. I.e. h3_to_geo with 2D input would require 3D output.

  • Add vectorized functions where easiest, first. Functions with uniform input and output dimension are easiest to support. A vectorized h3_to_geo_boundary would be harder because I assume the number of tuples returned per h3 cell is not constant.

  • Don't shy away from depending on Numpy headers if necessary. If the vectorized Cython code would be a lot simpler to implement by importing Numpy Cython code, we should embrace that. (I'm not currently sure whether depending on Numpy would make things easier). There's surely a way to modify CMakeLists.txt to compile the vectorized code only if Numpy headers are available, no? And Numpy will easily be available on CI

kylebarron avatar Apr 11 '22 18:04 kylebarron

* Don't shy away from depending on Numpy headers if necessary. If the vectorized Cython code would be a lot simpler to implement by importing Numpy Cython code, we should embrace that. (I'm not currently sure whether depending on Numpy would make things easier). There's surely a way to modify `CMakeLists.txt` to compile the vectorized code only if Numpy headers are available, no? And Numpy will easily be available on CI

This last point has the potential issue of requiring anyone installing h3-py the "normal" way to have numpy installed (unless there's an "optional dependency" concept in Python, too?) It feels like it might break things for users in memory-constrained environments or that have some requirement to continue using an older numpy, etc.

Not sure how big of a problem that is, but one way to avoid it would be to have this repo publish two packages, one that's "pure" h3-py and another that includes the vectorization/numpy/etc stuff?

dfellis avatar Apr 11 '22 18:04 dfellis

This last point has the potential issue of requiring anyone installing h3-py the "normal" way to have numpy installed (unless there's an "optional dependency" concept in Python, too?)

I don't think so. Not sure if by "normal" you mean via wheels or from source.

  • From wheels: I'm pretty sure that a wheel compiled against Numpy headers would not require Numpy at runtime if that file isn't imported. It would be essentially the same as our current numpy_int API where Numpy is an optional dependency, and so Numpy is only required if you import h3.api.numpy_int.
  • From source: I'm less familiar with scikit-build, but in a "normal" Cython setup it would be possible to only compile the Numpy Cython files if Numpy headers can be found at compile time. I.e. in my previous experience, in the setup.py file, you would import numpy and find the path to numpy's headers, then pass a list of Cython modules to ext_modules. It wouldn't be hard to simply remove a file from what's passed to ext_modules if numpy couldn't be imported. I assume scikit-build has similar functionality.

kylebarron avatar Apr 11 '22 18:04 kylebarron

  • I'm less familiar with scikit-build

From a cursory glance at the scikit-build docs it looks like it would be simple to update this CMakeLists.txt to conditionally add a Cython module only if Numpy was found at build time:

find_package(NumPy MODULE)

...

if (NumPy_FOUND)
    add_cython_file(unstable_vect)
endif()

kylebarron avatar Apr 11 '22 18:04 kylebarron

Got it, though Python being even less strict on optional dependencies than Node is surprising.

dfellis avatar Apr 11 '22 19:04 dfellis

it looks like the v4.0.0b2 pre release has removed unstable.vect.geo_to_h3. the scalar version of the function has been renamed from geo_to_h3 to latlng_to_cell. unclear if there are vectorised versions anywhere, or if this is outside v4.0 scope

rfc-hv avatar Nov 29 '22 23:11 rfc-hv