effex
effex copied to clipboard
Elementwise Kernel optimizations.
Hi @evanmayer, do you have a way to run your code without hardware? I've found a few places to make improvements. If no, I've added ideas below if you ever want to try.
- Input to channelizer is read-only, so you might not need to initialize
x
to zeros
x = cp.empty(len(x)+len(x)%n_branches, dtype=np.complex128) + x // Line 39
- Move to elementwise kernel to reduce data transfer and launch overhead
xcorr_array = psd_0 * (cp.conj(psd_1) * cp.exp(2j * (cp.pi) * freqs * (total_lag / rate) )) // Line 106
- Move to elementwise kernel to reduce data transfer and launch overhead
xcorr *= cp.exp(2j * cp.pi * freqs * (integer_lag / rate) ) // Line 185
# Prepare to fit residual phase gradient:
phases = cp.angle(xcorr) // Line 187
# Due to receiver bandpass shape, edge frequencies have less power => less certain phase
# Assign weights accordingly
weights = cp.abs(xcorr) // Line 190```
Thanks! I'll write some unit tests and use them to quantify the speedup.
Just sat down to look at this, realized I don't yet know how to do everything here. I've incorporated 1. and will push that, at least. I assume "move to elementwise kernel" refers to writing something like this, from the CuPy docs for the operations I'm doing?
Exactly! For good examples compare the following
- https://github.com/rapidsai/cusignal/blob/branch-0.19/python/cusignal/windows/windows.py
- https://github.com/scipy/scipy/blob/a376262/scipy/signal/windows/windows.py
I could look something like this
xcorr_array_kernel = cp.ElementwiseKernel(
"float64 psd_0, float64 psd_1, float64 freqs, float64 total_lag, float64 rate"
"float64 array",
"""
thrust::complex<double> conj_psd_1 { thrust::conj( psd_1 ) };
thrust::complex<double> temp { exp(
thrust::complex<double>( 0, 2 * M_PI * freqs * div ) ) };
output = psd_0 * conj_psd_1 * temp;
"""",
"xcorr_array_kernel",
options=("-std=c++11",),
loop_prep="const double div { total_lag / rate };",
)
This assume freqs is a float, nor array