RustFFT icon indicating copy to clipboard operation
RustFFT copied to clipboard

Make input immutable

Open michaelciraci opened this issue 10 months ago • 13 comments

I took at stab at implementing an immutable function call like this: https://github.com/ejmahler/RustFFT/issues/150

I didn't implement sse/neon/wasm yet, as I wanted to get feedback before I add those architectures as well.

michaelciraci avatar Mar 09 '25 04:03 michaelciraci

Oh, I was thinking much simpler. You can add a method to the trait with a default implementation, like here: https://github.com/ejmahler/RustFFT/blob/801d6a4dbc4f91994f2007b92831b1527428de47/src/lib.rs#L196 So I would add one that splits the scratch, copies the input, and calls the existing method (the one with mutable input). That way you just need to add it in one place.

HEnquist avatar Mar 09 '25 06:03 HEnquist

Well from a performance perspective, do you think it would be worth it to pass through the immutability to some of the process functions? There are a couple algorithms that actually don't modify the input, so we could eliminate some copies.

Here is an example where the input isn't modified at all: https://github.com/ejmahler/RustFFT/blob/dae4c82856824781b3c0e06f69bd732fd6dd8681/src/algorithm/butterflies.rs#L20-L51

michaelciraci avatar Mar 09 '25 06:03 michaelciraci

It's not obvious what way gives the best performance overall. The simple algorithms like the butterflies would benefit from a separate immutable method. But once you build more complicated FFTs that use inner FFTs, then you may end up with a mix of algorithms where some copy, and some don't. Here I would think that doing a single copy before starting the processing would be faster. How about measuring to see if it's worth the effort? The best the immutable process method can do is to be just as fast as the mutable one. So if we just implement the simple default implementation and add some benches to compare mutable and immutable, that would give a good idea of how much could be gained.

HEnquist avatar Mar 09 '25 09:03 HEnquist

Are you thinking of just something like this?

fn process_outofplace_with_scratch_naive(
        &self,
        input: &[Complex<T>],
        output: &mut [Complex<T>],
        scratch: &mut [Complex<T>],
    ) {
        let (mut input_scratch, scratch) = scratch.split_at_mut(input.len());
        input_scratch.copy_from_slice(input);
        self.process_outofplace_with_scratch(&mut input_scratch, output, scratch);
    }

The best the immutable process method can do is to be just as fast as the mutable one.

If you're looking at just the execution time of the FFT, I agree (but I suppose the compiler might be able to make some optimizations from the immutability). Although if you need the input after the FFT, I think there would be a big benefit to pass immutability through. For example:

fn process_signal(
        fft: Arc<dyn Fft>,
        sig: &mut [Complex<f32>],
        output: &mut [Complex<f32>],
        scratch: &mut [Complex<f32>],
        sig_clone: &mut [Complex<f32>],
        ) -> AudioOutput {
    // Copy sig into sig_clone for use later
    sig_clone.copy_from_slice(sig);
    // Process FFT
    fft.process_outofplace_with_scratch(sig, output, scratch);
    
    // Look at FFT'd output to find locations in pre-FFT data to process
    let found_audio = find_audio(output);

    // Now that we know the locations to process, go back to pre-FFT data
    process_audio(found_audio, sig_clone, output)
}

Either way, I can run some benchmarks and get back to you.

michaelciraci avatar Mar 09 '25 18:03 michaelciraci

I ran benchmarks on sizes that I have already implemented: 9, 16, 64, 128

I had the following benchmark setup:

const LENGTHS: [usize; 4] = [9, 16, 64, 128];

fn bench_naive(c: &mut Criterion) {
    for len in &LENGTHS {
        let mut planner = rustfft::FftPlanner::new();
        let fft: Arc<dyn Fft<f32>> = planner.plan_fft_forward(*len);
        assert_eq!(fft.len(), *len);


        let input = vec![Complex::zero(); *len];
        let mut output = vec![Complex::zero(); *len];
        let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len() + len];

        c.bench_function(&format!("naive-{}", len), |b| b.iter(|| {
            fft.process_outofplace_with_scratch_naive(&input, &mut output, &mut scratch);
        }));
    }
}

fn bench_immut(c: &mut Criterion) {
    for len in &LENGTHS {
        let mut planner = rustfft::FftPlanner::new();
        let fft: Arc<dyn Fft<f32>> = planner.plan_fft_forward(*len);
        assert_eq!(fft.len(), *len);


        let input = vec![Complex::zero(); *len];
        let mut output = vec![Complex::zero(); *len];
        let mut scratch = vec![Complex::zero(); fft.get_inplace_scratch_len() + len];

        c.bench_function(&format!("immut-{}", len), |b| b.iter(|| {
            fft.process_outofplace_with_scratch_immut(&input, &mut output, &mut scratch);
        }));
    }
}

These were my results:

naive-9                 time:   [11.833 ns 11.840 ns 11.850 ns]
immut-9                 time:   [9.2378 ns 9.2417 ns 9.2460 ns]

naive-16                time:   [12.321 ns 12.325 ns 12.331 ns]
immut-16                time:   [9.0788 ns 9.0825 ns 9.0867 ns]

naive-64                time:   [75.728 ns 75.768 ns 75.810 ns]
immut-64                time:   [69.823 ns 69.850 ns 69.879 ns]

naive-128               time:   [153.31 ns 153.37 ns 153.44 ns]
immut-128               time:   [142.04 ns 142.10 ns 142.17 ns]

I also compared this against the current mutable input process_outofplace_with_scratch, and you were right that there was no difference compared to the immutable function. The main benefit to immutability is if you need to reuse the input after the FFT is done.

michaelciraci avatar Mar 09 '25 18:03 michaelciraci

I also finished the implementation of mixed radix, and tested with 100, 125, 225:

naive-100               time:   [259.31 ns 259.60 ns 259.89 ns]
immut-100               time:   [251.03 ns 251.41 ns 251.81 ns]
mut-100                 time:   [250.04 ns 250.30 ns 250.60 ns]

naive-125               time:   [636.50 ns 636.91 ns 637.32 ns]
immut-125               time:   [625.15 ns 625.92 ns 626.81 ns]
mut-125                 time:   [622.76 ns 623.39 ns 624.38 ns]

naive-225               time:   [574.95 ns 575.23 ns 575.64 ns]
immut-225               time:   [553.02 ns 553.30 ns 553.68 ns]
mut-225                 time:   [554.14 ns 554.63 ns 555.32 ns]

It appears to me the immutable methods for the butterflies clearly beat out the "naive" default trait implementation. For mixed radix it does not provide the speed boost I was hoping for, although it does still edge out the naive default trait implementation.

I'm not sure what you'd like to do with all this info.

Perhaps the best path forward is to provide the default trait implementation, and override it for algorithms where the immutability would clearly provide a benefit (butterfly for example)? You're the maintainer, so whatever you think is best. You would also have more of an intuition to what algorithms would benefit from overriding the default trait implementation.

michaelciraci avatar Mar 09 '25 19:03 michaelciraci

Are you thinking of just something like this?

Yes that's exactly what I had in mind!

Perhaps the best path forward is to provide the default trait implementation, and override it for algorithms where the immutability would clearly provide a benefit (butterfly for example)?

That sounds like a good way forward. I suggested the benches to get an idea of how much performance the naive immutable approach costs, compared to doing it "properly" with much larger changes.

I took a quick look at Radix-4 (arguably the most important algorithm since it handles all the very common power of twos too large for butterflies) and it doesn't look like it needs mutable input, so it would be a good candidate to add an override to.

You're the maintainer, so whatever you think is best.

No that is @ejmahler. I have spent quite some time working on this library and try to help out with issues etc whenever I can, but I'm just a contributor.

HEnquist avatar Mar 09 '25 20:03 HEnquist

No that is @ejmahler. I have spent quite some time working on this library and try to help out with issues etc whenever I can, but I'm just a contributor.

My mistake; I noticed you were very active in this repo so thought maybe you were a maintainer as well. Regardless, I'll go down that path and look at Radix-4, unless @ejmahler prefers a different direction.

michaelciraci avatar Mar 09 '25 20:03 michaelciraci

One problem with the default impl approach is that the fft algorithms have to request scratch from the caller, and the default impl doesn't add that requested scratch. We could add another method to the fft trait that returns the required immutable out of place scratch, but I find it pretty difficult keeping even two in my head when writing new fft algorithms, so I have concerns about adding a third.

I see a few options, each with their own tradeoffs. I don't really have a preferred option as I start writing this, I just want to get my thoughts out:

  1. Status quo. In the current approach, it's less convenient and less performant to compute a FFT where the input buffer isn't mutable.
  2. Just make the existing out of place api take an immutable input, and adjust all existing algorithms to require extra scratch. The nice thing here is that some algorithms (including the most common usage of radix4 and most butterflies) wouldn't need to change at all, so the happy path would avoid having extra bugs introduced and would avoid a performance or memory requirement hit. Process_inplace for some things like MixedRadix would have to request extra scratch, and actually use the out of place transform internally - and instead of bouncing data back and forth between the input and output, they'd have to bounce data back and forth between the output and scratch, and the overall effect would be much worse cache locality, and exploiting cache locality is one of the main reasons those algorithms exist, so this would hurt the non happy path of powers of two.
  3. The approach currently implemented in the PR of adding an entirely new API endpoint. This would certainly do the best at preserving the performance of existing approaches, and would let each algorithm do the best possible job of handling the data provided to it. The downside is that we need an entirely new FFT implementation for a very significant number of algorithms, increasing the maintenance load. It's not as bad as I originally thought, because AVX mixed radix is done mostly with macros, almost all butterflies don't need to change at all except for the boilerplate macros, our bluestein's out of place mutable implementations can just forward to the mutable versions because they don't modify the input. But it's still a significant amount of new code.
  4. The default impl approach which requests scratch from the caller and copies, and then we can override it for radix4 and butterflies and other things that don't need the input. This is a great compromise as far as project complexity goes, but has the downside of the extra added complexity needed for the scratch method. Perhaps we could also add a default implementation of that, which just calls get_outofplace_scratch_len() and adds self.len()? And anything that overrides the default impl also overrides that method.
  5. A default impl approach that just allocates a scratch vec internally. This would nicely mirror our process() default impl that allocates scratch internally. And then Radix4 and butterflies etc can override it. On the most common path of radix4, this wouldn't change perf at all, and on less common paths, it's essentially a convenience method for what you need to do anyways with the status quo, and we can provide the same guidance that if you're calling it multiple times and want to avoid the allocation, call get_outofplace_scratch_len instead. Although that might steer people doing power of twos towards it that don't benefit at all from that approach.

Now that I've written all of that out, I think if I could rank the approaches in order preference, I'd sort them #4, #5, #1, #2, #3.

It's tempting to say I just prefer the status quo, but we've talked about radix4 and how it's A: the most common path, and B: already doesn't need to overwrite the input except in rare cases, so if we take approach #4 or #5 then we're making the library strictly better for the most common user, which I find to be a pretty compelling argument.

So how does the following approach sound?

  • Give the method a default impl that copies data around and then calls an existing method. I think I'd like to call it process_immutable_with_scratch because our existing method names are already too long without adding another word.
  • Add another method - get_immutable_scratch_len, also with a default impl
  • Override both methods for the algorithms that need it. Theoretically we could override it for every single algorithm to have the optimal implementation, but I think we'll get the most bang for our buck by doing it for the following:
    • Scalar, Sse, Neon, Wasm Radix4 - these would need subtly different versions for immutable vs mutable out of place, because of the way it uses the input as scratch for its inner fft
    • Radix3, RadixN - same as above
    • Scalar, Sse, Neon, Wasm, Avx Butterflies - these just need the mutable version to forward to the immutable version after data validation
    • Scalar, Avx Bluestein's- (The mutable version should also just forward to the immutable version after doing data validation)

The last question is what the default implementation of process_immutable_with_scratch should be. Here are the two choices i see, and of the two, i prefer the inplace version, but let me know what you two think:

fn process_immutable_with_scratch(
        &self,
        input: &[Complex<T>],
        output: &mut [Complex<T>],
        scratch: &mut [Complex<T>],
    ) {
        output.copy_from_slice(input);
        self.process_with_scratch(output, scratch);
    }
fn get_immutable_scratch_len(&self) -> usize {
    self.get_inplace_scratch_len()
}
fn process_immutable_with_scratch(
        &self,
        input: &[Complex<T>],
        output: &mut [Complex<T>],
        scratch: &mut [Complex<T>],
    ) {
        let (mut input_scratch, scratch) = scratch.split_at_mut(input.len());
        input_scratch.copy_from_slice(input);
        self.process_outofplace_with_scratch(&mut input_scratch, output, scratch);
    }
fn get_immutable_scratch_len(&self) -> usize {
    self.get_outofplace_scratch_len() + self.len()
}

ejmahler avatar Mar 10 '25 00:03 ejmahler

Actually, feel free to skip it for RadixNfor now, because i don't even know if it's gonna make it into the next release. I wish I had put it on a branch, in retrospect

ejmahler avatar Mar 10 '25 00:03 ejmahler

I think that suggested approach makes sense.

Regarding the default impl, I prefer the default impl of an in place as well. The scratch length would then match what is already existing.

michaelciraci avatar Mar 10 '25 01:03 michaelciraci

Before I make too much progress, I just wanted to put this here to get any feedback before I get too far down this road to make sure you had the same idea in mind: https://github.com/ejmahler/RustFFT/compare/master...michaelciraci:RustFFT:f632083c63b9aee50319946fa1c3b639298ccfd2

michaelciraci avatar Mar 10 '25 02:03 michaelciraci

Yeah so far it looks good. I was imagining that the butterfly boilerplate could be a lot shorter, but since the data validation is woven in and the data validation mentions specific method names, that might be harder than i thought.

ejmahler avatar Mar 10 '25 05:03 ejmahler

Implementation in PR here: https://github.com/ejmahler/RustFFT/pull/157

michaelciraci avatar Apr 28 '25 01:04 michaelciraci