micropython-ulab [FEATURE REQUEST] Add DCT

Add DCT-2 (Type-II Discrete Cosine Transform) functionality to ulab, similar to SciPy's implementation. DCT is widely used in image and audio processing applications.

Wikipedia: https://en.wikipedia.org/wiki/Discrete_cosine_transform SciPy: https://docs.scipy.org/doc/scipy/reference/generated/scipy.fftpack.dct.html

Technical Details

Implementation can be based on FFT using one of four established methods (detailed in this DSP StackExchange post)
I've benchmarked all four methods on a Raspberry Pi Pico using the 2025-01-01 micropython-builder build
Benchmark results and the code are at the end of this post

Benefits

Essential transform for signal processing applications
Minimal code footprint when implemented using existing FFT
Potential for performance improvement with C implementation (as opposed to users writing their own DCT)
Aligns with ulab's goal of providing NumPy/SciPy-like functionality

Benchmark

DCT_2N_M is the mirrored version
DCT_2N_P is the padded version.
Exprange is the time taken for np.exp(...) (refer to code below) , which can be memoized.

N	DCT_4N	DCT_2N_M	DCT_2N_P	DCT_N	FFT	EXPRANGE
4	1.455	1.976	2.026	2.076	0.462	0.992
8	2.136	2.541	2.568	2.476	0.678	1.199
16	3.618	3.691	3.741	3.293	1.096	1.654
32	6.819	6.125	6.244	4.923	1.735	2.532
64	13.798	11.278	11.448	8.36	3.23	4.312
128	28.994	22.136	22.56	15.504	6.481	7.843
256	61.7	45.2	47.0	30.6	13.2	14.6
512	133.5	96.0	98.5	62.8	28.6	29.5
1024	284.9	201.0	205.3	132.6	62.1	59.1
2048	611.8	416.7	425.3	268.6	134.5	117.2

I would recommend DCT_N (DCT_FFTN) in code with the L = 2 * np.exp(...) memoized for values of N for the last few calls. FFT compute and L = 2 * np.exp(...) cache memory occupied are the least for it.

If there is interest and drive to implement this feature, I will help find FFT implementations for other forms of DCT. I would implement this and create a PR, but wanted to check the waters first, is there interest in the community for this?

Code

from ulab import numpy as np
from time import ticks_ms, ticks_diff


def timer(func):
    def wrapper(*args, **kwargs):
        start = ticks_ms()
        for i in range(10):
            result = func(*args, **kwargs)
        end = ticks_ms()
        print(f"[] {func.__name__} took {ticks_diff(end, start)/10} milliseconds")
        return result
    return wrapper


@timer
def FFT(x):
    return np.fft.fft(x)


@timer
def DCT_FFT4N(x):
    N = x.shape[-1]

    u = np.zeros(4 * N)
    u[1:2*N:2] = x
    u[2*N+1::2] = x[::-1]

    U = np.fft.fft(u)[:N]
    return U.real


@timer
def DCT_FFT2NM(x):
    N = x.shape[-1]

    y = np.zeros(2*N)
    y[:N] = x
    y[N:] = x[::-1]

    Y = np.fft.fft(y)[:N]

    L = np.exp(-1j*np.pi*np.arange(N)/(2*N))

    return (Y.real * L.real) - (Y.imag * L.imag)


@timer
def DCT_FFT2NP(x):
    N = x.shape[-1]

    y = np.zeros(2*N)
    y[:N] = x

    Y = np.fft.fft(y)[:N]

    L = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))

    return (Y.real * L.real) - (Y.imag * L.imag)


@timer
def DCT_FFTN(x):
    N = x.shape[-1]

    v = np.zeros(x.shape, dtype=x.dtype)
    v[:(N-1)//2+1] = x[::2]

    if N % 2: # odd length
        v[(N-1)//2+1:] = x[-2::-2]
    else: # even length
        v[(N-1)//2+1:] = x[::-2]

    V = np.fft.fft(v)

    L = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))

    return (V.real * L.real) - (V.imag * L.imag)


@timer
def EXPRANGE(N):
    L = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))
    return L


for i in range(2, 12):

    P = 2**i

    x = np.arange(P)

    print(f"\nP = {P}")
    y = FFT(x)
    y = DCT_FFT4N(x)
    y = DCT_FFT2NM(x)
    y = DCT_FFT2NP(x)
    y = DCT_FFTN(x)

    z = EXPRANGE(P)

Jan 03 '25 09:01 zeroby0

People tend to run the same sized DCT many times in their code, so we can alternatively do

def mkDCT(N):

    L_N = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))
    def DCT_N(x):
        y = np.zeros(x.shape)

        y[:(N-1)//2+1] = x[::2]

        if N % 2: # odd length
            y[(N-1)//2+1:] = x[-2::-2]
        else: # even length
            y[(N-1)//2+1:] = x[::-2]

        Y = np.fft.fft(y)

        return (Y.real * L_N.real) - (Y.imag * L_N.imag)

    return DCT_N

So we can have memoization, and also have it garbage collected when it's not needed. This breaks correspondence with the scipy API though.

Jan 03 '25 09:01 zeroby0

If compatibility with scipy is a concern, you should implement this as part of the utils module. I would be a bit cautious with the benchmarks above, for the simple reason that all your functions generate a lot of one-time-use artifacts, thus, I believe that the real gain is actually larger. You can look at the implementation of https://micropython-ulab.readthedocs.io/en/latest/ulab-utils.html#spectrogram, where you can supply pre-allocated RAM to the method.

Having said that, if you think that there is interest in this function, I would be happy to incorporate it here. If you want to boost the visibility of the issue, you can also post in https://github.com/v923z/micropython-ulab/discussions.

Jan 03 '25 18:01 v923z

It would be very nice if this can be even faster! I've been looking at the source and documentation spectrogram and FFT, and figuring out a way to not create the y variable, but haven't figured it out yet. However, y can be shared across DCT calls.

However, it seems most of the time is spent in FFT and in generating the L_N with the exponent. Gain from removing the y variable might be less than 5%. But I would love to be proven wrong here! Even if the improvement is small, I love seeing clever code!

~Do you have any comment on why FFT is taking so long? Pico has hardware floating point support, so I expected it would be much faster than ulab_samples benchmarks.~ NVM, the Pico RP2040 doesn't have an FPU.

OP	Time
FFT	59.3 ms
Exponent generation	58.2 ms
Complex Multiplication	8.6 ms
zero allocation	0.1 ms
rest of the code	2.7 ms

def mkDCT(N):

    L_N = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))
    y   = np.zeros(N)

    @timer
    def DCT_N(x):

        y[:(N-1)//2+1] = x[::2]

        if N % 2: # odd length
            y[(N-1)//2+1:] = x[-2::-2]
        else: # even length
            y[(N-1)//2+1:] = x[::-2]

        Y = np.fft.fft(y)

        return (Y.real * L_N.real) - (Y.imag * L_N.imag)

    return DCT_N

Jan 04 '25 13:01 zeroby0

It would be very nice if this can be even faster! I've been looking at the source and documentation spectrogram and FFT, and figuring out a way to not create the y variable, but haven't figured it out yet. However, y can be shared across DCT calls.

I'm not quite sure I follow: in the FFT, you return a new array, because you want to leave x alone. So, can't you just do the assignment in C like here?

https://github.com/v923z/micropython-ulab/blob/73ed8cc11f57c5c7605d97eaf5303687a2d21106/code/utils/utils.c#L339-L350

However, it seems most of the time is spent in FFT and in generating the L_N with the exponent.

On a microcrontoller, that's not the most efficient code: you calculate the exponential of a purely imaginary number, and then you also take the real and imaginary parts at the very end of your function. The same can be achieved by taking the sin and cos of the same array. You can save a lot by that. Standard math functions accept the out keyword argument, which means that you can also re-use RAM, if you calculate Y.real * L_N.real - Y.imag * L_N.iimag in two separate calls.

Also note that the instruction L_N = 2 * np.exp(-1j*np.pi*np.arange(N)/(2*N)) creates

an integer array (np.arange(N)),
then a float (np.arange(N)/(2*N)),
then another float (np.pi * np.arange(N)/(2*N)),
then a complex (-1j*np.pi*np.arange(N)/(2*N)),
then another compex (np.exp(-1j*np.pi*np.arange(N)/(2*N)),
and finally another complex (2 * np.exp(-1j*np.pi*np.arange(N)/(2*N))

And after all this, you then take L_N.real and L_N.imag, which then generate two real arrays.

This is really a lot of hassle, when you actually want something like this:

t = np.linspace(0, 0.5*np.pi, num=N)

...

Y.real *= np.cos(t)
Y.imag *= np.sin(t)
Y.real -= Y.imag
Y.real *= 2

return Y.real

If you really want to, you can save one memory allocation by doing this:

t = np.linspace(0, 0.5*np.pi, num=N)

...
scratchpad = np.zeros(N)
np.cos(t, out=scratchpad)

Y.real *= scratchpad

np.sint, out=scratchpad)

Y.imag *= scratchpad
Y.real -= Y.imag
Y.real *= 2

return Y.real

I don't know where the most time is spent, but since the operations are all in-place, your RAM won't be fragmented. I expect this to be very processor-specific, thus, hard numbers won't probably mean too much.

Jan 04 '25 15:01 v923z

I thought I knew how to write performant code and, reading this reply, I'm just amazed. I've never seriously written for micro-controllers, but now I wish I had. I'm going to try out these ideas first thing in the morning! Thank you so much!

Jan 04 '25 16:01 zeroby0

As I said, gains are specific to the hardware, even to the compiler settings, so you might have to experiment. But here are a couple of comments: https://micropython-ulab.readthedocs.io/en/latest/ulab-tricks.html, and I would also highly recommend Damien George's talk at pyconau: https://www.youtube.com/watch?v=hHec4qL00x0.

Part of the problem here is that the word efficient is ill-defined. Do you mean that the code should run fast? Or that it should consume little RAM? Or it could consume RAM, but it shouldn't call the garbage collector all the time? Or it should consume little flash? Or should the call overhead be minimal? These are not trivial questions, and your answer will depend on many factors, on the application, on the hardware, on the acceptable complexity of the firmware etc.

Jan 04 '25 19:01 v923z

micropython-ulab micropython-ulab copied to clipboard

[FEATURE REQUEST] Add DCT

Technical Details

Benefits

Benchmark

Code

micropython-ulab
micropython-ulab copied to clipboard