ulc-codec icon indicating copy to clipboard operation
ulc-codec copied to clipboard

Feasibility of accelerated decoding on the N64

Open phoboslab opened this issue 4 months ago • 6 comments

I'm very impressed with quality/bitrate of this codec. I had the thought that it might be a good fit for the N64. I built a quick prototype that runs entirely on the N64 CPU, using your C decoder: https://github.com/phoboslab/ulc64/ - this runs in realtime for 44100hz Stereo, but only barely so (tested with the ares emulator).

The N64 has a vector co-processor (RSP) that I thought might be able to handle the IMDCT part of the decoding, but the more I look into it, the more questions I have...

The RSP only has 4kb of DMEM. To do the entire IMDCT without additional DMAs, block size would need to be lowered to at least 1024, assuming int16_t precision and a few hundred bytes of extra data needed. But in your GBA source, the IMDCT is performed at 20(?) bit precision. The RSP can also work with 16.16 fixed point, but that would require another drop in block size to 512. Not great, but still possible, I assume.

Now here's the part that escapes me: The C decoder for ULC uses separate input and output buffers as well as a temp buffer. The lapping could probably remain on the CPU without too much overhead, but is it possible to do the IMDCT for ULC "in place" using just one buffer? Or would it be possible to do the IMDCT in multiple steps with not too many DMAs to the RSP?

As a reference, the LibDragon team ported the Opus IMDCT to the RSP (source rsp_opus_*.S) where the 960 frame size just barely fits. The Opus decode performance is ok for menus/cutscenes, but a bit too demanding for background music in a 3d game.

I have no idea how well ULC would fare on the N64, but I assume it could be more performant than Opus, considering how light-weight it is!?

Do you think this is feasible and would be worth it over Opus, performance wise?

phoboslab avatar Sep 16 '25 13:09 phoboslab

I'll preface by saying that I've never worked with N64, and my only familiarity with how things work is Kaze's videos on YouTube.

in your GBA source, the IMDCT is performed at 20(?) bit precision.

This is really only to keep the decode precision high, because you get best performance on GBA using LDM/STM instructions that load/store multiple 32-bit integers, so it's just a case of "may as well". I've tested down to 14-bit precision while experimenting with replacing 64-bit multiplies with 32-bit multiplies, and this doesn't seem to have any noticeable issues at 8-bit audio output, though.

The C decoder for ULC uses separate input and output buffers as well as a temp buffer.

Right, so: The input buffer is used to store the decompressed MDCT coefficients (so always BlockSize in length), the lapping buffer stores half a block of samples for lapping with the previous block (BlockSize/2 in length), the output buffer stores the decoded output (8-bit samples, BlockSize*2 in length for double-buffering), and then there's a temp buffer used for the inverse DCT4 step. The temp buffer is, again, just a performance thing for GBA, where it's faster to ping-pong between two buffers using LDM/STM instructions rather than doing an in-place FFT, where you'd be forced to use LDR/STR instructions at variable step sizes (this is probably a performance /killer/ in x86, though, because of the massive amounts of cache needed, but I've never gotten around to actually trying to improve it for that use case). So if you're memory-limited, all you really need is a 16-bit DCT transform buffer (on which you can do an in-place DCT4 using FFT), 16-bit lapping buffer (making sure to clip the samples before storing), and a 16-bit (or 8-bit, or whatever) output buffer.

The codec is written so that the M/S transform is applied before any transformations, so the inverse M/S transform is performed on decoded samples. The idea behind doing this is that you only need a single transform buffer for all channels, because you can just re-use it for each channel, decoding the output block, and later do the inverse M/S transform to recover your audio (instead of using transform buffers for each channel, which would add even more memory use).

Do you think this is feasible and would be worth it over Opus, performance wise?

I reckon the biggest performance killer for RSP specifically would be the IMDCT, so performance on the RSP side would probably be similar to Opus (when using BlockSize=1024, anyway). On the CPU side, though, performance should be much much better, because of the simplicity of the coefficient syntax.

Aikku93 avatar Sep 16 '25 19:09 Aikku93

As Phoboslab said, the RSP coprocessor only has 4096 bytes of data memory and we would need to fit the IMDCT of one encoding block in there. In theory that would allow for 1024 32-bit sample but unfortunately we can't allocate 100% of the memory to samples, we do need al little bit of other data (constants, state, pointer to main memory, etc.). Unfortunately with the current pow2 constraint, this makes the window size drop straight to 512, jumping all possible intermediate values. For instance, 960 would be a good candidate.

Assuming a custom/different MDCT/IMDCT implementation (on N64 we have a RSP-optimized implementation based on kissfft), would the window size still need to be a pow2? I understand that SubBlockSize is handled via shifts up to 3, but that seems to keep a constraint for the window size to be a multiple of 8 (MAX_BLOCK_DECIMATION_FACTOR). Is there anything else that would break down?

rasky avatar Sep 16 '25 22:09 rasky

In theory that would allow for 1024 32-bit sample

I'm guessing that a 16-bit transform buffer (using 14-bit precision, or whatever) isn't really usable, then? I'm not familiar with N64 instructions (or RSP instructions, for that matter), but like I said... the only real reason I went with 32-bit buffers on GBA/DS was because of the much faster access using LDM/STM (19 cycles for eight values) instructions as compared to LDRSH/STRH (40 cycles for eight values).

That said, though, I've never actually tested the range of possible values that occur in the intermediates of the inverse transform, so I don't know how much bit growth actually happens. The forward transform scales by N/2 without any normalization, but does the inverse transform have a similar range for the intermediates? The codec writes the coefficient quantization factors with the assumption that then running the IMDCT would have /no/ scaling, so that if you set your coefficient precision to 14, your output sample data would be 14-bit audio (ie. instead of x = iDCT(X) * sqrt(1/N) as in standard "orthogonal" transforms, the decoder assumes x = iDCT(X) directly).

would the window size still need to be a pow2?

No, there's no actual reason it needs to be that way. The encoder and decoder has the built-in assumption that this is the case (mostly just in the DCT-related routines, iirc), but there's no technical reason that the block size needs to be anything in particular (it doesn't even /need/ to be a multiple of 8, as long as the selected window sizes remain integer, obviously).

Aikku93 avatar Sep 17 '25 08:09 Aikku93

One thing to notice is that right now we have a 32-bit implementation ready, not a 16-bit one :) Rewriting it to be a 16-bit implementation is indeed possible but it would require rewriting large part of the assembly code. In fact, it would take multiple weeks. Before even attempting that, I'd try first to port the decoder to kissfft which is the FFT implementation I'm familiar with and then try the 16bit version there to check the final resulting quality. I'd expect the noise floor to raise a bit because of the required rescaling to avoid overflows, so I guess it makes sense to first assess the output quality of a 16bit vs a 32bit version. IOW, if I were to choose between a lapped window of 512 samples with 32-bit samples, or a lapped window of 1024 samples with 16-bit samples, I'm not sure offhand which one would sound better. Maybe your intuition is better than mine here? Obviously the 16-bit version is at least twice as fast (and possibly more) so that would be another thing to weigh in.

The current IDCT is x = iDCT(X) as you say, so we need to take into account the magnitude of coefficients. I assume that within the butterfly functions (which we support for radix 2, 3, 4, and 5), the multiplications with twiddles (which are in range 0..1) don't enlarge the magnitude of the samples, while the additions do. I believe that the number of additional bits required is ceil(log2(radix)) so 1,2,2,3 bits for the four radixes cited above. In theory, assuming the input is full-scale (16-bit), one would have to prescale the transform buffer by those amount of bits before entering the butterfly function.

To be picky, one could even scan the buffer and note the actual current maximum magnitude to decide what to do; in theory that's pretty fast anyway. This would allow to avoid raising the noise for windows that wouldn't clip.

Or otherwise, if I set the coefficient precision to eg. 14 bit, I have two bits of margins that I can avoid during the FFT. So that's something to take into account while creating the "FFT plan" (list of operations to perform -- this is something that kissfft does either at build time or at runtime, depending on whether you need to adapt to any possible window size, or the supported window sizes are fixed and known at build time).

rasky avatar Sep 17 '25 16:09 rasky

IOW, if I were to choose between a lapped window of 512 samples with 32-bit samples, or a lapped window of 1024 samples with 16-bit samples, I'm not sure offhand which one would sound better.

It mostly depends on the type of audio being encoded (512-sample windows @ ~128kbps 44.1kHz would sound fine for natural audio, but may start degrading with heavily distorted synths and the like). The major issue with using smaller windows in this codec is that the coefficient precision is /really/ low, so you generally want the block size to be as large as possible to avoid MP3-like "sizzle" artifacts; that's why the default block size is set to 2048. I might look at eventually optimizing for a block size of 1024 samples, though, since this would probably be better for porting to any platform (GBA uses really excessive amounts of memory for the 2048-sample block size; someone else is currently using it with 256-sample blocks, too, so I might see about optimizing for different block sizes in general).

I believe that the number of additional bits required is ceil(log2(radix))

Huh, interesting. The major reason I wasn't sure about the bit growth was based on ryg's DCT design posts, where he explains that the inverse transform doesn't seem to increase the magnitude, which is why they can get (barely) away with a 16-bit DCT design. But I guess that using FFT might be different to a hardcoded DCT.

Aikku93 avatar Sep 17 '25 19:09 Aikku93

I'm not sure if this is still a relevant topic, but I've done some more work on the GBA decoder to come up with hard numbers (and also fine-tuned the encoder to give better results for smaller BlockSize).

I've implemented an in-place transform for the IMDCT (no more temp buffer), but the memory usage for BlockSize=1024 is still over 4KiB:

  • 04h*1024 bytes for the transform buffer
  • 04h*(1024/2) bytes for the lapping buffer
  • Total = 6KiB

However, if you're willing to use 16-bit buffers instead of 32-bit, I've confirmed that it's possible to halve the memory usage (provided that you're still able to do intermediate 32-bit multiplies with them). The only caveat is that the decoded coefficients must be stored in 14-bit precision, but I've otherwise detected no 16-bit overflow during the DCT pre-/post-rotation stages, or the FFT stages (for reference, my FFT implementation doesn't do any scaling between the stages; X = FFT(x) rather than the usual X = Sqrt[1/N]*FFT(x) or (1/N)*FFT(x)).

Aikku93 avatar Sep 27 '25 21:09 Aikku93