micropython-ulab
micropython-ulab copied to clipboard
[FEATURE REQUEST] Add DCT
Add DCT-2 (Type-II Discrete Cosine Transform) functionality to ulab, similar to SciPy's implementation. DCT is widely used in image and audio processing applications.
Wikipedia: https://en.wikipedia.org/wiki/Discrete_cosine_transform SciPy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.fftpack.dct.html
Technical Details
- Implementation can be based on FFT using one of four established methods (detailed in this DSP StackExchange post)
- I've benchmarked all four methods on a Raspberry Pi Pico using the 2025-01-01 micropython-builder build
- Benchmark results and the code are at the end of this post
Benefits
- Essential transform for signal processing applications
- Minimal code footprint when implemented using existing FFT
- Potential for performance improvement with C implementation (as opposed to users writing their own DCT)
- Aligns with ulab's goal of providing NumPy/SciPy-like functionality
Benchmark
DCT_2N_Mis the mirrored versionDCT_2N_Pis the padded version.- Exprange is the time taken for
np.exp(...)(refer to code below) , which can be memoized.
| N | DCT_4N | DCT_2N_M | DCT_2N_P | DCT_N | FFT | EXPRANGE |
|---|---|---|---|---|---|---|
| 4 | 1.455 | 1.976 | 2.026 | 2.076 | 0.462 | 0.992 |
| 8 | 2.136 | 2.541 | 2.568 | 2.476 | 0.678 | 1.199 |
| 16 | 3.618 | 3.691 | 3.741 | 3.293 | 1.096 | 1.654 |
| 32 | 6.819 | 6.125 | 6.244 | 4.923 | 1.735 | 2.532 |
| 64 | 13.798 | 11.278 | 11.448 | 8.36 | 3.23 | 4.312 |
| 128 | 28.994 | 22.136 | 22.56 | 15.504 | 6.481 | 7.843 |
| 256 | 61.7 | 45.2 | 47.0 | 30.6 | 13.2 | 14.6 |
| 512 | 133.5 | 96.0 | 98.5 | 62.8 | 28.6 | 29.5 |
| 1024 | 284.9 | 201.0 | 205.3 | 132.6 | 62.1 | 59.1 |
| 2048 | 611.8 | 416.7 | 425.3 | 268.6 | 134.5 | 117.2 |
I would recommend DCT_N (DCT_FFTN) in code with the L = 2 * np.exp(...) memoized for values of N for the last few calls. FFT compute and L = 2 * np.exp(...) cache memory occupied are the least for it.
If there is interest and drive to implement this feature, I will help find FFT implementations for other forms of DCT. I would implement this and create a PR, but wanted to check the waters first, is there interest in the community for this?
Code
from ulab import numpy as np
from time import ticks_ms, ticks_diff
def timer(func):
def wrapper(*args, **kwargs):
start = ticks_ms()
for i in range(10):
result = func(*args, **kwargs)
end = ticks_ms()
print(f"[] {func.__name__} took {ticks_diff(end, start)/10} milliseconds")
return result
return wrapper
@timer
def FFT(x):
return np.fft.fft(x)
@timer
def DCT_FFT4N(x):
N = x.shape[-1]
u = np.zeros(4 * N)
u[1:2*N:2] = x
u[2*N+1::2] = x[::-1]
U = np.fft.fft(u)[:N]
return U.real
@timer
def DCT_FFT2NM(x):
N = x.shape[-1]
y = np.zeros(2*N)
y[:N] = x
y[N:] = x[::-1]
Y = np.fft.fft(y)[:N]
L = np.exp(-1j*np.pi*np.arange(N)/(2*N))
return (Y.real * L.real) - (Y.imag * L.imag)
@timer
def DCT_FFT2NP(x):
N = x.shape[-1]
y = np.zeros(2*N)
y[:N] = x
Y = np.fft.fft(y)[:N]
L = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))
return (Y.real * L.real) - (Y.imag * L.imag)
@timer
def DCT_FFTN(x):
N = x.shape[-1]
v = np.zeros(x.shape, dtype=x.dtype)
v[:(N-1)//2+1] = x[::2]
if N % 2: # odd length
v[(N-1)//2+1:] = x[-2::-2]
else: # even length
v[(N-1)//2+1:] = x[::-2]
V = np.fft.fft(v)
L = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))
return (V.real * L.real) - (V.imag * L.imag)
@timer
def EXPRANGE(N):
L = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))
return L
for i in range(2, 12):
P = 2**i
x = np.arange(P)
print(f"\nP = {P}")
y = FFT(x)
y = DCT_FFT4N(x)
y = DCT_FFT2NM(x)
y = DCT_FFT2NP(x)
y = DCT_FFTN(x)
z = EXPRANGE(P)
People tend to run the same sized DCT many times in their code, so we can alternatively do
def mkDCT(N):
L_N = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))
def DCT_N(x):
y = np.zeros(x.shape)
y[:(N-1)//2+1] = x[::2]
if N % 2: # odd length
y[(N-1)//2+1:] = x[-2::-2]
else: # even length
y[(N-1)//2+1:] = x[::-2]
Y = np.fft.fft(y)
return (Y.real * L_N.real) - (Y.imag * L_N.imag)
return DCT_N
So we can have memoization, and also have it garbage collected when it's not needed. This breaks correspondence with the scipy API though.
If compatibility with scipy is a concern, you should implement this as part of the utils module. I would be a bit cautious with the benchmarks above, for the simple reason that all your functions generate a lot of one-time-use artifacts, thus, I believe that the real gain is actually larger. You can look at the implementation of https://micropython-ulab.readthedocs.io/en/latest/ulab-utils.html#spectrogram, where you can supply pre-allocated RAM to the method.
Having said that, if you think that there is interest in this function, I would be happy to incorporate it here. If you want to boost the visibility of the issue, you can also post in https://github.com/v923z/micropython-ulab/discussions.
It would be very nice if this can be even faster! I've been looking at the source and documentation spectrogram and FFT, and figuring out a way to not create the y variable, but haven't figured it out yet. However, y can be shared across DCT calls.
However, it seems most of the time is spent in FFT and in generating the L_N with the exponent. Gain from removing the y variable might be less than 5%. But I would love to be proven wrong here! Even if the improvement is small, I love seeing clever code!
~Do you have any comment on why FFT is taking so long? Pico has hardware floating point support, so I expected it would be much faster than ulab_samples benchmarks.~ NVM, the Pico RP2040 doesn't have an FPU.
| OP | Time |
|---|---|
| FFT | 59.3 ms |
| Exponent generation | 58.2 ms |
| Complex Multiplication | 8.6 ms |
| zero allocation | 0.1 ms |
| rest of the code | 2.7 ms |
def mkDCT(N):
L_N = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))
y = np.zeros(N)
@timer
def DCT_N(x):
y[:(N-1)//2+1] = x[::2]
if N % 2: # odd length
y[(N-1)//2+1:] = x[-2::-2]
else: # even length
y[(N-1)//2+1:] = x[::-2]
Y = np.fft.fft(y)
return (Y.real * L_N.real) - (Y.imag * L_N.imag)
return DCT_N
It would be very nice if this can be even faster! I've been looking at the source and documentation spectrogram and FFT, and figuring out a way to not create the
yvariable, but haven't figured it out yet. However,ycan be shared across DCT calls.
I'm not quite sure I follow: in the FFT, you return a new array, because you want to leave x alone. So, can't you just do the assignment in C like here?
https://github.com/v923z/micropython-ulab/blob/73ed8cc11f57c5c7605d97eaf5303687a2d21106/code/utils/utils.c#L339-L350
However, it seems most of the time is spent in FFT and in generating the
L_Nwith the exponent.
On a microcrontoller, that's not the most efficient code: you calculate the exponential of a purely imaginary number, and then you also take the real and imaginary parts at the very end of your function. The same can be achieved by taking the sin and cos of the same array. You can save a lot by that. Standard math functions accept the out keyword argument, which means that you can also re-use RAM, if you calculate Y.real * L_N.real - Y.imag * L_N.iimag in two separate calls.
Also note that the instruction L_N = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N)) creates
- an integer array (
np.arange(N)), - then a
float(np.arange(N)/(2*N)), - then another
float(np.pi * np.arange(N)/(2*N)), - then a
complex(-1j*np.pi*np.arange(N)/(2*N)), - then another
compex(np.exp(-1j*np.pi*np.arange(N)/(2*N)), - and finally another
complex(2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))
And after all this, you then take L_N.real and L_N.imag, which then generate two real arrays.
This is really a lot of hassle, when you actually want something like this:
t = np.linspace(0, 0.5*np.pi, num=N)
...
Y.real *= np.cos(t)
Y.imag *= np.sin(t)
Y.real -= Y.imag
Y.real *= 2
return Y.real
If you really want to, you can save one memory allocation by doing this:
t = np.linspace(0, 0.5*np.pi, num=N)
...
scratchpad = np.zeros(N)
np.cos(t, out=scratchpad)
Y.real *= scratchpad
np.sint, out=scratchpad)
Y.imag *= scratchpad
Y.real -= Y.imag
Y.real *= 2
return Y.real
I don't know where the most time is spent, but since the operations are all in-place, your RAM won't be fragmented. I expect this to be very processor-specific, thus, hard numbers won't probably mean too much.
I thought I knew how to write performant code and, reading this reply, I'm just amazed. I've never seriously written for micro-controllers, but now I wish I had. I'm going to try out these ideas first thing in the morning! Thank you so much!
As I said, gains are specific to the hardware, even to the compiler settings, so you might have to experiment. But here are a couple of comments: https://micropython-ulab.readthedocs.io/en/latest/ulab-tricks.html, and I would also highly recommend Damien George's talk at pyconau: https://www.youtube.com/watch?v=hHec4qL00x0.
Part of the problem here is that the word efficient is ill-defined. Do you mean that the code should run fast? Or that it should consume little RAM? Or it could consume RAM, but it shouldn't call the garbage collector all the time? Or it should consume little flash? Or should the call overhead be minimal? These are not trivial questions, and your answer will depend on many factors, on the application, on the hardware, on the acceptable complexity of the firmware etc.