geoutils
geoutils copied to clipboard
Move `filter.py` from xDEM to GeoUtils and integrate into `raster` class
Resolves #690.
Context
The current filter.py module in xDEM provides a collection of filters that are highly useful for digital elevation models (DEMs). However, these filtering utilities are not intrinsically tied to elevation data—they are generally applicable to any geospatial raster data. Given this, it makes more architectural sense for them to live in GeoUtils.
Changes
- Moved the
filter.pymodule fromxdem/togeoutils/filters.py. - Removed
gaussian_filter_cvto avoid dependence on opencv. - Moved
tests/test_filter.pyin xDEM totests/test_filters.pyin GeoUtils. - Added a new function
_nan_safe_filtercommon for gaussian, mean and median filter, to deal with arrays containing nan values. - Added a generic
_filterfunction mapping the available filters. - Added the
Raster.filtermethod that extracts the array with Nans, uses the_filterfunction, and rebuild the Raster data.
Tests
- Added new tests for the
Raster.filtermethod.
Documentation
- Added a
filtersection inraster_class.md.
Note
When this ticket is approved, a linked PR to remove filters in xDEM will be opened.
I've been working on a very similar problems during the past week in https://github.com/GlacioHack/xdem/pull/486, so while I still know the intricacies of the topic, here's how we could add Numba filters in GeoUtils to dramatically improve the speed for median (and we could re-use the same code structure for nmad, nanmax, and any other nanfunc):
from typing import Literal
import numpy as np
from numba import jit, prange
from scipy.ndimage import generic_filter
@jit(parallel=True)
def median_filter_numba(dem: np.ndarray, window_size: int):
# Get input shapes
N1, N2 = dem.shape
# Define ranges to loop through given padding
row_range = N1 - window_size + 1
col_range = N2 - window_size + 1
# Allocate output array
outputs = np.full((row_range, col_range), fill_value=np.nan, dtype=np.float32)
# Loop over every pixel concurrently by using prange
for row in prange(row_range):
for col in prange(col_range):
outputs[row, col] = np.nanmedian(dem[row: row+window_size, col:col+window_size])
return outputs
def median_filter(dem: np.ndarray, window_size: int, engine: Literal["scipy", "numba"]):
# Using SciPy, options already include edge behaviour
if engine == "scipy":
return generic_filter(dem, np.nanmedian, size=window_size, mode="constant", cval=np.nan)
# Using Numba, need to define edge behaviour ourselves
elif engine == "numba":
# We pad the input array with NaNs for half the window size
hw = int((window_size - 1) / 2)
dem = np.pad(dem, pad_width=((hw, hw), (hw, hw)), constant_values=np.nan)
return median_filter_numba(dem, window_size)
############
# TEST SPEED
############
from time import time
rnd = np.random.default_rng(42)
nb_it = 10 # Several iterations to be more precise measuring CPU time
window_size = 11
sizes = [50, 100, 500, 1000] # Array sizes
# Call Numba once in the void to trigger JIT compile
_ = median_filter(np.ones((10, 10)), window_size=3, engine="numba")
list_df = []
for s in sizes:
dem = rnd.normal(size=(s, s))
# Get computation times
t1 = time()
for i in range(nb_it):
filt1 = median_filter(dem, window_size=window_size, engine="scipy")
t2 = time()
for i in range(nb_it):
filt2 = median_filter(dem, window_size=window_size, engine="numba")
t3 = time()
# Check output is equal
assert np.allclose(filt1, filt2)
# Print elapsed times
print(f"Size: {s}")
print(f"Elapsed SciPy: {t2 - t1}")
print(f"Elapsed Numba: {t3 - t2}")
Size: 50
Elapsed SciPy: 0.24158883094787598
Elapsed Numba: 0.005011320114135742
Size: 100
Elapsed SciPy: 0.9490830898284912
Elapsed Numba: 0.012121915817260742
Size: 500
Elapsed SciPy: 23.553598880767822
Elapsed Numba: 0.3076012134552002
Size: 1000
Elapsed SciPy: 94.9611337184906
Elapsed Numba: 1.2290620803833008
Quite a striking improvement, factor of 100! :wink:
Moved to #699