DSP.jl Implement time-domain convolution and use it for integers

This does the very minimal thing to get discussion started. I think we should do the time-domain convolution in more cases, but before going in that direction, I'd first like to get feedback.

In particular, for "small" u and v, time-domain convolution would be more efficient and I see no reason not to do it. It just takes a bit of benchmarking to figure out what "small" should be exactly, here. And it would also fix this minor issue:

julia> conv(zeros(0), zeros(0))
ERROR: ArgumentError: invalid Array dimensions

julia> conv(zeros(0), zeros(1))
ERROR: InexactError: trunc(Int64, -Inf)

julia> conv(zeros(1), zeros(0))
ERROR: InexactError: trunc(Int64, -Inf)

The time-domain convolution form this PR does the right thing:

julia> conv(zeros(Int, 0), zeros(Int, 0))
Int64[]

julia> conv(zeros(Int, 0), zeros(Int, 1))
Int64[]

julia> conv(zeros(Int, 1), zeros(Int, 0))
Int64[]

Another point worth considering is whether to use time-domain convolution for more input types. Consider

julia> conv(zeros(BigFloat, 1), zeros(BigFloat, 1))
ERROR: type BigFloat not supported

julia> conv(zeros(Rational{Int}, 1), zeros(Rational{Int}, 1))
1-element Vector{Float64}:
 0.0

For BigFloat, this would change from throwing an error to working, but potentially slow, which I'd consider an improvement. For Rational, I wonder whether doing the computation on Rationals (and also returning Rationals) wouldn't be better. The extreme measure would be to do time-domain convolution by default and only use FFT-based approaches for Float32 and Float64 input (above a certain problem size), which would still cover most cases in practice, I guess. Opinions?

Note that if we decide to do time-domain convolution in more cases, we can always do that in later PRs; this one could stand on its own.

Closes #411, fixes #410.

Feb 23 '24 07:02 martinholters

Codecov Report

Attention: Patch coverage is 92.75362% with 5 lines in your changes missing coverage. Please review.

Project coverage is 97.26%. Comparing base (04e4593) to head (879b409).

Files with missing lines	Patch %	Lines
ext/StaticArraysExt.jl	0.00%	2 Missing :warning:
src/dspbase.jl	96.96%	2 Missing :warning:
ext/OffsetArraysExt.jl	0.00%	1 Missing :warning:

Additional details and impacted files

@@            Coverage Diff             @@
##           master     #545      +/-   ##
==========================================
- Coverage   97.40%   97.26%   -0.14%     
==========================================
  Files          16       18       +2     
  Lines        3199     3221      +22     
==========================================
+ Hits         3116     3133      +17     
- Misses         83       88       +5

:umbrella: View full report in Codecov by Sentry.
:loudspeaker: Have feedback on the report? Share it here.

Feb 23 '24 08:02 codecov[bot]

Should tdconv be exported just like tdfilt? Then users could more easily choose the cutoff themselves.

Feb 23 '24 08:02 wheeheee

maybe Val(N) would help with type stability

It does for N>10, which may not be too relevant, but then again ... why not?

Feb 23 '24 09:02 martinholters

Should tdconv be exported just like tdfilt? Then users could more easily choose the cutoff themselves.

Hm. Export tdconv and fdconv (names subject to bike-shedding) and have conv do some best-effort choosing among them? Or add kwarg to conv to force a choice instead of the additional exports?

Feb 23 '24 09:02 martinholters

I kind of like the names tdconv and fdconv, but maybe adding a kwarg is better, since people might also want to choose the overlap-save algorithm and that's one more function.

Feb 24 '24 06:02 wheeheee

I've implemented the keyword-argument-based algorithm selection. I'm not entirely sure I like it, but the first impression is okayish.

The algorithm set to :auto means "select based on problem size", which is the default for types known to be reasonably FFT-able (Float32, Float64, and their Complex versions). For other types the default is :direct. This lets user opt into :auto for other types at their own risk, but provides a sane default.

Feb 26 '24 10:02 martinholters

I took reference to FastConv.jl (licensed under the "MIT License (Expat)") and wrote a version with an explicit loop over CartesianIndices for 1-based arrays (unsure about offset indices). Here it is:

function _tdconv1!(out, E::AbstractArray{<:Number,N}, k::AbstractArray{<:Number,N}) where {N}
    output_indices = CartesianIndices(ntuple(i -> 1:(size(E, i)+size(k, i)-1), Val(N)))
    checkbounds(out, output_indices)
    fill!(out, 0)
    offset = CartesianIndex(ntuple(_ -> 1, Val(N)))
    for j in CartesianIndices(E), i in CartesianIndices(k)
        out[i+j-offset] = muladd(E[j], k[i], out[i+j-offset])
    end
    return out
end

Benchmarks:

julia> x = rand(100); y = rand(100); out = rand(199);

julia> @benchmark _conv_td!($out, $x, $y)
BenchmarkTools.Trial: 10000 samples with 1 evaluation.
 Range (min … max):  9.000 μs …  38.000 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     9.100 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   9.216 μs ± 602.871 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

      █    █                                            ▃     ▂
  ▅▁▁▁█▁▁▁▁█▁▁▁▁█▁▁▁▁▃▁▁▁▁▁▁▁▁▁▃▁▁▁▁▃▁▁▁▁▁▁▁▁▁▁▁▁▁▁▇▁▁▁▁█▁▁▁█ █
  9 μs         Histogram: log(frequency) by time      10.2 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

julia> @benchmark _tdconv1!($out, $x, $y)
BenchmarkTools.Trial: 10000 samples with 10 evaluations.
 Range (min … max):  1.510 μs …   5.720 μs  ┊ GC (min … max): 0.00% … 0.00%
 Time  (median):     1.520 μs               ┊ GC (median):    0.00%
 Time  (mean ± σ):   1.583 μs ± 101.580 ns  ┊ GC (mean ± σ):  0.00% ± 0.00%

     █                                                   ▂
  ▄▁▁█▁▁▃▁▁▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▂▁▁▄▁▁█▁▁▂ ▂
  1.51 μs         Histogram: frequency by time        1.69 μs <

 Memory estimate: 0 bytes, allocs estimate: 0.

Feb 26 '24 14:02 wheeheee

Yep. I used the generator-based version because it allowed to derive the output type based on the actual values. With the latest change from _conv_td to _conv_td!, a simple loop like you have provided is probably better. Once we're sure about the API, I'll try to optimize, probably by switching to such a loop.

Feb 27 '24 07:02 martinholters

I've implemented the keyword-argument-based algorithm selection. I'm not entirely sure I like it, but the first impression is okayish.

If you're referring to the somewhat messy organization in conv, rather than the heuristic used or the ergonomics of the keyword arguments, I have a commit ready to fix that up, if you'd like.

Feb 27 '24 09:02 wheeheee

If you're referring to the somewhat messy organization in conv, rather than the heuristic used or the ergonomics of the keyword arguments, I have a commit ready to fix that up, if you'd like.

I was primarily referring to the keyword-based API. Once that's settled, the organization of the conv methods can be streamlined. And the heuristic when to choose time vs frequency domain is mostly a placeholder for now, that needs adjustment either way, but that's somewhat orthogonal.

So let's discuss API. Some random thoughts I have gathered while working on this:

We should allow the user to explicitly select time or frequency domain. I'm not entirely convinced explicit distinction between "simple FFT" and "overlap-save" is needed. Would there be a use case where a caller needs to decide this?
We should have conv(u,v) do something reasonable by default. To me, that means at least to do small convolutions for non-FFT-able datatypes in time domain and large convolutions of FFT-able datatype in frequency domain. For large integer convolutions, it is not clear whether the long runtime of time domain or the rounding issues of frequency domain are the lesser evil, so an automatic mode that only considers problem size but ignores input types might also be desirable.
For the poly-algorithm conv, the result type should be independent of the algorithm, and I guess the only thing take really makes sense is eltype(conv(u,v)) == promote_type(eltype(u), eltype(v)).
If we decide to implement conv as first allocating the output and then calling a conv! function, we should consider making the latter public.
The implementation in the first commit here uses the collect machinery to derive the output eltype based on the actual values. That may have applications in rare situations, but I wouldn't want to make the API more complicated just to allow this.
We might want to be a bit careful with swapping u and v in the light of non-commutative number types.

I think the current kwargs approach addresses the first three items, although choosing between :auto and :direct based on the eltype might come a bit surprising. I'm not sure I like that. The alternative would be two different auto-like keywords, I guess.

Feb 27 '24 13:02 martinholters

My opinion, for what it's worth:

I initially thought it possible in rare cases that the slower algorithm is chosen, or that it could be platform-dependent, so that it could be nice to give users a choice by exposing os vs fft. As this doesn't seem to be the case, confusing users with these options seems pointless and without other benefits, I take back that suggestion.
I think many would find conv! welcome, totally in favour of that.
WRT swapping arguments, since AFAIK there isn't a trait or a general way to determine commutativity, we should probably define another union of types that are commutative and where swapping is legal, for correctness? Along with documentation, @warn for non-explicitly-allowed types could be warranted since that can really slow things down, though I'm not sure if @warn does bad things even on the slow path...
On integer convolutions: I'm on the side again of :direct by default, so path 2 in #410, with another warning if the inputs aren't reasonably small. This would obviate the need for another keyword argument, if I understand you correctly, and another benefit, as was mentioned in that issue, is conforming with SciPy's behaviour.
Additionally, not sure if when conv! is in use, and the output array is of a wider type (although again, not too sure how to determine that), we should promote the elements from u and v first before multiplying and adding.

Feb 27 '24 16:02 wheeheee

I'm no fan of @warnings, unless there is also an option to say "I know what I'm doing, no need to flood me with warnings". And including that option would make the API somewhat clunky, in my opinion.

Otherwise, this is slowly getting shape -- partially in code, partially in my head for now. One thing I haven't made up mind about is how conv! should deal with offset axes. A short recap for conv:

If all input axes are OneTos, so will be the output axes, and if the input indices are $1,\ldots,M$ and $1,\ldots,N$, the output indices will be $1,\ldots,M+N-1$ as a OneTo (for every axis).
If there is at least one non-OneTo axis, and if the input indices are $0,\ldots,M-1$ and $0,\ldots,N-1$, the output indices will be $0,\ldots,M+N-2$ (for every axis). Shifts in the input are reflected in the output, so if the input indices are $1,\ldots,M$ and $1,\ldots,N$, the output indices will be $2,\ldots,M+N$. This directly matches the $y[n]=\sum v[m]\cdot u[n-m]$ definition.

The two cases are inconsistent with each other for the input-indices-start-at-1 case, but each one individually is very much the only reasonable choice, so that's where we are. Now if the output provided to conv! matches what conv returns, all is well, but what if not. If the output has a OneTo axis, just storing the results starting at 1 seems ok, I'd say. If the output and inputs have offsets, but they don't "match", if all required indices are present in the output, storing the result there and zeroing the rest sounds reasonable. If the result does not fit into the output, a BoundsError seems warranted. A silent truncation might also be considered. But what if all inputs are OneTos but the output is not? Put the results starting at 2, starting at 1, or error?

Of course, erroring for unclear cases would open the possibility to decide later on, so that might be the safest route for now.

Feb 28 '24 13:02 martinholters

Sounds good. Another thought I had regarding commutativity and axes, while reading this after prematurely micro-optimizing the loop: even if we consider * potentially non-commutative, I suspect we could still retain the performance benefits of swapping u and v by simply changing the iteration order of the CartesianIndices.

~~However, as it is, there is the potential for type instability if the axes of each CartesianIndices are not of the same type, although union splitting may automatically take care of that. If this is implemented, I think extracting the kernel out and using a function barrier would at least look nicer, if it isn't strictly necessary.~~ Scratch that, I see that they are all UnitRanges now...

we should promote the elements from u and v first before multiplying and adding.

I also just realized that muladd automatically promotes first.

Feb 28 '24 15:02 wheeheee

These all seem like great changes!

Feb 29 '24 14:02 galenlynch

This turns out to be a but more of a rabbit hole than I had anticipated, but it's slowly getting into shape. Apart from some more tests and documentation, the main to-do appears to be a better heuristic to choose the fastest algorithm. But the working of the algorithm kwarg also may need some more consideration. Current status:

:direct chooses time-domain direct convolution.
:fft chooses an FFT-based algorithm; whether simple or overlap-save is chosen automatically. Should we allow the caller to choose?
:auto picks the faster (well, presently modulo the shortcomings of the super-simple heuristic) of those.

The default for algorithm is chosen based on the (output) eltype: :auto if it is Float32, Float64 or a Complex of those, :direct otherwise. Another option would be to have another keyword, say :fastest, to do what :auto does now, and have :auto be what the default does now (i.e. being equal to :direct or :fastest depending on eltype). Would the latter be better?

Another alternative would be to replace the algorithm keyword with something like allow_fft_based::Boolean where false corresponds to current :direct and true to current :auto, with the default again decided by eltype. Would that be preferable?

Feb 29 '24 14:02 martinholters

I think allowing the user to choose the algorithm would be better. In my previous attempts to add direct convolution to this and figure out which alg to choose, I found it very hard to come up with heuristics since dimensionality, size of the problem, element type, number of threads available, and architecture (eg avx instructions available) all come into play. Doing a best effort selection and letting users who really care override seems better imo.

Feb 29 '24 14:02 galenlynch

That would then include the option the choose between simple FFT and overlap-save, right? (I had this in an earlier iteration, can revive.)

Feb 29 '24 14:02 martinholters

I agree with giving users the ability to choose specific algorithms. Maybe we could have fft for auto-selection, and fft_simple, fft_overlapsave to specify an algorithm.

Feb 29 '24 17:02 mbaz

I think the second option with :auto replacing the current default behaviour seems safer. Perhaps a simplified general-user-facing conv(u, v; mode=:auto(/:unsafe_fast)) would be friendlier, with :unsafe_fast enabling conversion to floats and FFT for numbers that aren't IEEE floats / complex. And, for those who need it, put the additional, more involved options in conv!(out, u, v; alg), also defining an allocate_output for convenience, where one can choose a specific algorithm, :fft, or the options in conv. The different keyword arguments would be confusing though.

Mar 01 '24 04:03 wheeheee

Maybe we could have fft for auto-selection, and fft_simple, fft_overlapsave to specify an algorithm.

Do you mean fft behave like the present auto, i.e. also allowing time-domain? That would be a bit surprising, so probably not. So I assume fft should mean "use one of the FFT-based algorithms, but I don't care which one". Sure, that can be done.

@wheeheee I don't think allowing different kwargs in conv and conv! for similar purposes would help simplify the interface for users in any way.

Mar 01 '24 07:03 martinholters

So I assume fft should mean "use one of the FFT-based algorithms, but I don't care which one". Sure, that can be done.

Yes, that's what I meant.

Mar 01 '24 14:03 mbaz

Ok, as there were no complaints about the current API, I've added docs and more tests. If someone could check the docs for comprehensibility, that would be welcome. Remaining to-do is to devise a heuristic for deciding between time and frequency domain.

Mar 11 '24 14:03 martinholters

Sorry this has stalled. The only thing left to do from my perspective is to come up with a heuristic to choose between time-domain and frequency-domain. I meant to base this on some benchmarking, but that "some" turned out to be quite a lot, especially when not restricting to 1-d. For the time-domain algorithm, my benchmarking confirmed that C*length(u)*length(v) is a reasonable estimate for the runtime with C depending on dimensionality and element type. (Interestingly, dimensionality and element type seem to be non-orthogonal here, though.) But some preliminary benchmarking I did for the FFT convolution gave surprisingly inconclusive results.

If anyone wants to pick this up or has a good idea for a benchmarking strategy - any help is appreciated!

Mar 28 '24 12:03 martinholters

Not sure what the detailed benchmarks say, but would it be acceptable to start with the conservative heuristic here since it's IIUC a definite improvement for small sizes? Also saw this and tried it out, with muladd too; I think it fixes the problem with OffsetArrays not SIMDing as well as normal arrays. Hopefully @simd is safe for arbitrary eltypes. Another small problem: the padding heuristic using nextfastfft isn't always optimal, causing performance to have weird jumps sometimes.

julia> x = rand(ComplexF64, 245); rx = reverse(x);

julia> @btime conv($x, $rx; algorithm=:fft_simple);
  45.600 μs (13 allocations: 31.91 KiB)

julia> x = rand(ComplexF64, 246); rx = reverse(x);

julia> @btime conv($x, $rx; algorithm=:fft_simple);
  27.500 μs (13 allocations: 32.28 KiB)

Mar 31 '24 17:03 wheeheee

Another pair of eyes proof-reading the added warning to the docstring would be welcome. Then, this should be ready to (squash!)-merge.

Oct 16 '24 07:10 martinholters

Just a side note that from 1.10 -> 1.11 the tests take something like 0.8x more time, which imo is a pretty serious performance regression. Can't be certain but probably not this PR's fault, since it's present in all sections. Locally, it's not that bad, but still slightly worse.

Oct 17 '24 12:10 wheeheee

Just a side note that from 1.10 -> 1.11 the tests take something like 0.8x more time, which imo is a pretty serious performance regression. Can't be certain but probably not this PR's fault, since it's present in all sections. Locally, it's not that bad, but still slightly worse.

Julia 1.11.0 had some compile time (and other latency) regressions, which apparently have only been partially fixed in 1.11.1. I'm inclined to blame those, given that the tests are typically pretty compile-heavy. So with some luck, things will improve with upcoming Julia versions. If not, we should inspect this more closely.

Oct 18 '24 06:10 martinholters

Okay, pushed a version utilizing package extensions. The StaticArrays part is just a proof of concept for now and needs tests, but might better be added in a separate PR later.

Oct 28 '24 13:10 martinholters

@wheeheee are you ok with this in the present shape? Then, once #575 has landed, I'll remove the last commit with the StaticArrays stuff (to reappear in a separate PR) and merge. Ok with you?

Oct 29 '24 10:10 martinholters

Sounds good! Not familiar with package extensions, so maybe it would be better if there's another reviewer...

Oct 29 '24 11:10 wheeheee