Real FFT?
Thanks for this great crate! It would be great to have a faster way for doing a real to complex fft (such as from time to frequency domain), which is a very common use case. It would approximately halve the computation time compared to doing it via a complex to complex fft. https://www.dspguide.com/ch12/1.htm For my use case I need to replace another fft crate (which uses cmake) because I want to cross compile to mac os, and I only need the real to complex fft for pow2 size.
I needed the same thing and ended up making this: https://crates.io/crates/realfft
It's a simple wrapper of RustFFT that does real to complex fft, and complex to real ifft. And yes it runs about twice as fast.
Thanks for this great crate as well :smile:
Just want to mention that this was maybe up next after this AVX stabilization (https://github.com/awelkie/RustFFT/issues/4#issuecomment-646829960), crossing fingers.
In any case, I'm using realFFT for my project as well, just thought it'd be great to have this directly in the RustFFT crate :)
Once 5.0 is out, I plan to turn my attention to RustDCT for a little while. I don't think I'll be implementing much AVX stuff there, but I at least want to get it compiling with 5.0, and using the same scratch space setup that rustfft uses.
After that, I can start looking into real-only FFTs.
What sort of API do you exactly need for a real-only FFT? What's the application? Do you need access to complex numbers on one side of the FFT, but not the other? Or is it real-only on both sides?
I think these two are by far most common use cases:
- Transform to get spectrum of real valued data. Needs real-to-complex fft.
- Transform real valued data, modify the spectrum, and transform back to real-valued data. Needs real-to-complex fft and complex-to-real ifft.
For even lengths, it's possible to use a complex-to-complex (i)fft of half the length, but I haven't found any such trick for odd lengths. That means that odd length real fft will most likely be much more work.
The existing implementations in other libraries do things in a few ways. Let's start with a 6 element long real vector: [x0r, x1r, x2r, x3r, x4r, x5r]
The simple approach of just converting it to complex becomes: [(x0r, 0), (x1r, 0, (x2r, 0), (x3r, 0), (x4r, 0, (x5r, 0)]
FFT:in x becomes: [(X0r, X0i), (X1r, X1i), (X2r, X2i), (X3r, X3i), (X4r, X4i), (X5r, X5i)]
Because x was real, this becomes a bit simpler: [(X0r, 0), (X1r, X1i), (X2r, X2i), (X3r, 0), (X2r, -X2i), (X1r, -X1i)]
The last two values are redundant, so let's remove them: [(X0r, 0), (X1r, X1i), (X2r, X2i), (X3r, 0)]
Starting from here, I have seen these different solutions:
- Return this complex vector, [(X0r, 0), (X1r, X1i), (X2r, X2i), (X3r, 0)]. Advantage: easy to understand, can be directly multiplied with another spectrum (for convolution etc). Downside: Stores two zeros, means output is longer than input.
- Put X3r in the place of X0i: [(X0r, X3r), (X1r, X1i), (X2r, X2i)]. Advantage: same length as input (can be used for in-place transforms). Downside: can't be directly multiplied, can be confusing.
- (worst one!) Just squeeze out the zeros: [(X0r, X1r), (X1i, X2r), (X2i, X3r)]. Advantage: same length as input (can be used for in-place transforms). Downside: can't be directly multiplied, highly confusing.
I think in-place real-to-complex/complex-to-real transforms are a bad idea, as it means the complex values of the spectrum have to be split and stored as separate real values in the input vector. In RealFFT I went with option 1, and only out-of-place transforms.
As a starting point I would propose to add the same functionality as RealFFT has:
- Only even lengths
- Only out-of-place transforms
- "Easy" data order
- Change the direction enum to a transform type enum with variants C2CForward, C2CInverse, R2CForward, C2RInverse
There's a #4: fftw's R2HC format
Oh, didn't see that one. Not sure which one is worst then, my #3 or the R2HC format...
I've been using FFTS until now, The Fastest Fourier Transform in the South (it's faster than FFTW), specifically I've been using the 1d real function for real-to-complex FFT: https://github.com/linkotec/ffts/blob/2c8da4877588e288ff4cd550f14bec2dc7bf668c/include/ffts.h#L96
The output of a real-to-complex transform is N/2+1 complex numbers, where the redundant outputs have been omitted.
@ejmahler @HEnquist See also this: https://www.smcm.iqfr.csic.es/docs/intel/mkl/mkl_manual/fft/fft_PackedFormats.htm
There are 3 ways to pack the complex output in-place into the input slice, to avoid allocations:
CCS, Pack and Perm.
CCS uses 2 more elements ([1] and [n+1], which are 0) so it can't be used to write the output back into the input slice.
Pack and Perm use exactly as many elements as the input, but if you compare the indices at which the output for an input ends up, Pack is easier to misuse, because the (R, I) pairs are not at even indices.
IMO, Perm makes the most sense because it uses exactly as many elements as the input and the i-th complex output is at the index 2*i, with the only exception that R[n/2] is stored at index 1.
I can just mention that the RealFFT crate is now updated to use RustFFT 5.0.
For my use case I convert a real to complex then take the magnitude of the first half the resulting output.
// Switch to Complex input
for i in 0..FFT_LEN {
self.0[i].re = input[i];
self.0[i].im = 0.0;
}
FFT_FULL.process_with_scratch(&mut self.0[..], &mut self.1[..]);
// Calculate the magnitude
for i in 0..FFT_LEN/2 {
input[i] = self.0[i].norm();
}
Where Self is
/// FFT from the `rustfft` crate
pub struct RsFft(Box<[Complex32; FFT_LEN]>, Box<[Complex32; FFT_LEN]>);
When I last bench marked with rustfft 4.0, IPP (using the ipp-sys crate) was much faster for that I used the functions ippsFFTFwd_RToPerm_32f then ippsMagnitude_32fc. But it complicated the build a lot so I didn't use it by default, I suspect if rust-fft had equivalent to ippsFFTFwd_RToPerm_32f, it would probably now be fast enough for me to drop ipp.
The main thing I dislike with the inplace solutions is that they all offer very poor ergonomics since they need to store complex values in a real vector. The advantages of the Complex type are lost. @Cocalus what lengths are you typically using? Can they be odd? How much faster was ipp?
The in place solutions are great if you can hide the ugliness behind an API
I kind of wish we could do something like return a boxed iterator over the complex output, so we don’t actually need storage for it.
Specifically returning a boxed iterator isn’t great because of the allocation, but are there any applications where lazily generating the complex output would be useful? Or for C2R, taking an iterator as a parameter
On Thu, Jan 7, 2021 at 12:32 AM Henrik Enquist [email protected] wrote:
The main thing I dislike with the inplace solutions is that they all offer very poor ergonomics since they need to store complex values in a real vector. The advantages of the Complex type are lost. @Cocalus https://github.com/Cocalus what lengths are you typically using? Can they be odd? How much faster was ipp?
— You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub https://github.com/ejmahler/RustFFT/issues/28#issuecomment-755967612, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAI2M6SMVOZAFBMEIFIQAWTSYVWSTANCNFSM4VMWGMWQ .
I was using 4096, I think IPP only supports powers of two. I retested with my case of real to magnitudes at 4096. In my benchmark IPP 2019 is about 5.5x faster than RustFFT 5.0. With 10000 iterations and only initializing the FFTs once (plan_fft_forward + ippsFFTInit_R_32f), rustfft took 175ms and IPP took 31ms.
RustFFT has to move things into and out of the real buffer on top of twice as the effective FFT size. So adding a Real API should probably fix a decent chunk of that difference. The IPP magnitude calculation is also vectorized which probably helps. I could probably upload the hacked together benchmark if some wants to poke at it, but I think you need to register an email with intel to download IPP.
Have you looked at the RealFFT crate? At 4096 the fft should be about twice as fast, and you can skip the conversion to complex.
To explain my earlier comment in more detail:
The core values of rust are performance, safety, productivity. The ideal real FFT API would have all three.
I think we're all uncomfortable with the "keep the data inplace and do some weird trick to pack the results" output formats because it would heavily sacrifice productivity.
I think it also hurts safety. Users will have to write loops with random array access to pull data out of the packed formats - and in my experience, it's easy to write loops like that correctly, but hard to write them in a way that avoids bounds checks. Many users will feel tempted to write things like get_unchecked().
So we're talking about doing the half-complex output instead - but I think that also hurts productivity and safety, just less so. Users won't have to remember a bizarre layout, but as I think about it, I think we're asking the average RustFFT user too much if we ask them to remember "we omit the last n/2 - 1 elements of the array, if you want them, iterate from element 1 to element n / 2 - 1 in reverse order and negate them all".
So why not make the default output be the full complex array? Do the reversal and negation ourselves, and return a size-N array instead of size-N/2+1, so that the caller doesn't have to remember the exact specifics of which subsection of the array to reverse.
Once we have that idea, my next thought is, how often do callers need the output in an array? In the case of RustDCT, you compute the DCT2 with an inner real-only FFT. Once you have the full complex output, you do some extra processing anyways, and the final output is just real numbers. So you never actually store the complex output, you only keep it around long enough to compute the real output, and then throw away the complex output.
I on the other hand definitely prefer the half-complex format. My main use case is processing of audio data, doing stuff like convolution. Then I have a real valued signal, and a real valued impulse response that I want to convolute. So I transform them both, multiply the complex results, and transform back. Using the half-complex format means I have half as many coefficients to multiply and store.
I think that most people who know they could use a Real2Complex fft in their application already have a good understanding of how ffts work, and have no problem with the half-complex format (as long as it doesn't use any weird packing tricks of course). But it would make sense to provide some conventient way of filling in the second half of the data when someone needs the full result.
Another thought, if Real2Comples fft returns the full complex result, what should the Complex2Real ifft do? Ignore half of the data? Or assert that it's conjugate symmetric? Neither seems very attractive..
Yeah, the point about convolution makes sense.
Although - an implicit part of the Complex-To-Real, even with the redundant elements omitted, is that the imaginary part of the first and last value is 0. Presumably if you do any sort of convolution involving multiplying with a complex number, your input to C2R is going to have a nonzero imaginary component. How does that affect the output?
Doing convolution that way only works when both the signals are real. Then they both have zeros in those places of their transforms, and the zeros stay also after element-wise complex multiplication of the two transforms.
Would it ever make sense to do a R2C where the inner FFT is inverse? Or only forward?
Would it ever make sense to do a C2R where the inner FFT is forward? Or only inverse?
I think only forward R2C and inverse C2R are useful in practice. I have never seen the other two (fftw doesn't have them for example) and I can't think of any reason for ever using them.
I did some investigation on why it's so hard to find non-even real FFT algorithms:
In order to compute a real FFT, you basically have to apply Mixed Radix, or rader's, or bluestein's. But like, a "Real FFT" version of those algorithms. You have to just derive a version of the algorithm that manually cuts out the wasted work.
Size 2 is a special case, because it takes advantage of the fact that you can compute 2 real FFTs in parallel by computing a single c2c DFT, with one set of real data in the reals, and the other FFT in the imaginaries. You can completely recover the separate outputs of the two FFTs with a O(n) postprocess. So the algorithm in the PR and elsewhere is basically merging a MixedRadix(2, n/2) step (for the 2 parallel FFTs) with the parallel FFT postprocess.
For any non-even number, we'd need mixed radix algorithm, raders, bluesteins, butterflies, planning, the works.
Having to switch back and forth between the real array and the "n / 2 + 1" array seems like a nightmare, but apparently a real-only FFT is trivially convertible to a transform called a "Discrete Hartley Transform", which is real-only, has all the same algorithms, etc, but with the advantage that all the outputs are the same size.
So my current thinking is to make a new planner specifically for real FFTs, and put all this stuff in there. It can have a FFT planner internally if we find that it can speed things up to use an actual FFT, but otherwise, it'll stay off in its own little Discrete Hartley World.
That's not a final plan obviously, but seeing how many different classes are going to be involved, I think it shouldn't be a part of the main planner.
Given that we're going to almost double the code size, we might even want to split it into a separate crate entirely.
This explains a lot, thanks! I'm guessing a DHT would not be able to directly reuse very much of the code in RustFFT(?). Then it makes more sense to put it in a separate crate, RustDHT perhaps? Then the question is what to do about the Real FFTs. Another separate crate on top of RustDHT? This would then be an argument for keeping also the R2C stuff separate from RustFFT.
Building the DHT will be a pretty long term project. What to do in the meanwhile? One easy fix that I think is still better than nothing, is to provide R2C/C2R for odd lengths by just using C2C of the same length. This would at least mean there is a single interface that works for all lengths, even if there is no speed advantage for the odds. I could add this to RealFFT for now, and include the optimizations from https://github.com/ejmahler/RustFFT/pull/50.
And even if it's not faster, there would be added convenience of not having to keep track of the redundant part of the complex output
Yes the convenience itself makes it worth having it. Are you adding this in RustFFT now-ish? If yes I won't do anything more to RealFFT, just mark it as deprecated once this is done.
It will be a while, i recommend porting it to realfft
Maybe it makes sense to take inspiration from FFTS? https://github.com/linkotec/ffts
It's faster than FFTW.
The ffts_init_1d_real function does exactly what we need:
https://github.com/linkotec/ffts/blob/2c8da4877588e288ff4cd550f14bec2dc7bf668c/include/ffts.h#L96
From what I can tell, it's using the same trick as us: https://github.com/linkotec/ffts/blob/2c8da4877588e288ff4cd550f14bec2dc7bf668c/src/ffts_real.c#L407 I don't see anything that would help with odd length real ffts.
The documentation of FFTS is pretty thin, but I think its main algorithms are only for powers of two. For anything else it uses Bluesteins. If that is true, it might be faster for powers of two, and the (quite few) special cases where the fastest solution is Bluestein. For everything else it will most likely be quite a bit slower.
I made a big rewrite of RealFFT: https://github.com/HEnquist/realfft/pull/7 It now behaves a lot more like RustFFT, with a planner that reuses ffts. And whenever it makes sense it uses the same function names as RustFFT. It also includes the optimiztions from https://github.com/ejmahler/RustFFT/pull/50 Finally it also supports odd lengths, by just using a complex fft of the same length and hiding the conversions.
I dug into FFTS, and it's even worse: If you pass in an odd size, it crashes on unaligned SSE loads. If you disable SSE, it computes the wrong output.
So yeah, they just assume your input is even without asserting it.