LPCNet icon indicating copy to clipboard operation
LPCNet copied to clipboard

LPCNet superseded by FARGAN

Open jmvalin opened this issue 1 year ago • 38 comments

LPCNet is no longer being actively developed. It will continue to be available but for most applications, users are encouraged to switch to the Framewise Autoregressive GAN (FARGAN). FARGAN achieves better quality than LPCNet with just 600 MFLOPS complexity. That's 1/5 of the complexity of the most optimized LPCNet and 1/20 of the original LPCNet.

See our demo page for comparisons with LPCNet, HiFi-GAN, CARGAN and FWGAN. The PyTorch source code along with an optimized C implementation are available as part of the larger Opus codec implementation (FARGAN is used for PLC and deep redundancy within Opus).

jmvalin avatar Oct 11 '24 05:10 jmvalin

Hi~ I've read the code and articles about FARGAN. In the latest opus 1.5.2 source code, FARGAN acts as a vocoder after the silk plc for improving the quality of synthesized speech, rather than directly for packet loss concealment. I would like to ask if I want to train a specialized FARGAN network to perform packet loss concealment directly, what should I do? Is there a code repository implementation for this? Thanks in advance, Freya

YYX666660 avatar Dec 19 '24 07:12 YYX666660

FARGAN is a complete vocoder just like LPCNet. In Opus, the FARGAN signal completely replaces (not enhance) the SILK PLC output.

jmvalin avatar Dec 19 '24 18:12 jmvalin

Thank you so much for your reply!

I'm a beginner of codec, and I'm not sure if my understanding is correct 🙏 In the Opus1.5.2 source code, when it comes to lpcnet_plc_conceal module, it seems that the process begins with using compute_plc_pred to update the features and states (using PLCModel, part of the LPCNet), and then compute_pitchdnn computes features for FARGAN (using PitchDNN model). Lastly, FARGAN (using FARGAN model) is used to synthesize pcm. So these three models are used to complete plc together?

If I want to perform plc directly, those three models must be trained and replaced the existing ones in Opus1.5.2, how can I train them separately? (only find the pytorch implementation of FARGAN)

YYX666660 avatar Dec 23 '24 13:12 YYX666660

Any simple-to-understand migration guide from those using LPCNet encoder / decoder for offline audio encoding and decoding?

rafael2k avatar Apr 21 '25 13:04 rafael2k

  1. Not everyone has the luxury of using PyTorch (especially in more advanced embedded systems...C'mon!)
  2. Shouldn't you be providing that optimized C impl as a standalone like y'all did with LPCNet?

Since the above items are reality...you claim it, but it's not the same thing...not yet.

madscientist42 avatar May 28 '25 04:05 madscientist42

PyTorch replaces the TF code in LPCNet. Both still have a C implementation, so I'm failing to see what the issue is. There's also a standalone executable called fargan_demo which can be used in a similar was as lpcnet_demo.

jmvalin avatar Jun 05 '25 03:06 jmvalin

Thanks @jmvalin. But as user of LPCNet, I'm in the same situation of @YYX666660, @madscientist42, and others. I also could not find the solution on how to use fargan the same way I use lpcnet.

Could you point us where to find the C code "fargan_demo" which could be used as drop-in replacement of "lpcnet_demo"?

ps: and we are all very excited to be able to use a true free software state-of-the-art ml-based audio encoder!

rafael2k avatar Jun 05 '25 18:06 rafael2k

Like the README points out, all you need to do is build opus with --enable-deep-plc (or --enable-dred will work too) and it'll built a fargan_demo executable that you can use in a similar way to lpcnet_demo. The only difference is there's no longer a 1.6 kb/s compression mode, but that was never very good to begin with.

jmvalin avatar Jun 06 '25 01:06 jmvalin

Thanks @jmvalin! Which bitrate fargan is using by default? For HF radio, 1.6 kbit/s was kind of in the limit of what is possible in a 3 kHz channel.

rafael2k avatar Jun 06 '25 18:06 rafael2k

So there's two definitions that people use for a vocoder and a lot of people (myself included) will use them in a confusing way. FARGAN is a vocoder in the original meaning that it can generate speech from (uncoded) acoustic features. LPCNet started like that too, but gained a quantization mechanism at 1.6 kb/s in the same package. That quantization could technically be reused as-is for FARGAN, but I didn't bother because it's not super efficient.

That being said, if what you care about is speech over HF radio, then you might be interested in David Rowe's RADE (radio auto-encoder) work from FreeDV that uses FARGAN as a vocoder but has an encoder that directly generates the baseband signal. The resulting quality is much better than what 1.6 kb/s LPCNet could ever achieve (even without loss). You can read the paper at https://arxiv.org/pdf/2505.06671 and listen to some samples at https://freedv.org/davids-freedv-update-september-2024/

jmvalin avatar Jun 06 '25 20:06 jmvalin

Hi @jmvalin. I'm well aware of RADE, but as you said, it also generates the baseband, which is not what one needs when integrating to any already existing modem. Also, as far as I know, it still has no optimized C implementation.

So just to check if I understood well - FARGAN has no quantization mechanism as LPCNet, so it is not a replacement of LPCNet for a super-low-bitrate audio encoder use case?

rafael2k avatar Jun 07 '25 19:06 rafael2k

Hm, I wonder why the paper does not mention vocos, which is a de-facto standard vocoder for neural TTS applications.

In a way it has superseded HiFi GAN, it is faster and easier to use and train, since it operates in spectral domain, without the need to compute heavy deconvolutions.

snakers4 avatar Jun 07 '25 19:06 snakers4

So just to check if I understood well - FARGAN has no quantization mechanism as LPCNet, so it is not a replacement of LPCNet for a super-low-bitrate audio encoder use case?

FARGAN is a replacement for the vocoder part of the LPCNet project, but it not a replacement for the compression part of it. It's likely that the compression code from LPCNet can just get ripped out and used for FARGAN, but I don't know if it's really worth it. The compression code was a quick and dirty proof of concept, but as I said I believe one could do a lot better today.

jmvalin avatar Jun 07 '25 20:06 jmvalin

Hm, I wonder why the paper does not mention vocos, which is a de-facto standard vocoder for neural TTS applications.

Wasn't aware of vocos, so I had a quick look and I don't think it addresses the same needs as FARGAN. FARGAN is really aiming for small/efficient. As a comparison, the vocos paper lists models with 14M parameters (and no mention of weight quantization), whereas the largest FARGAN model has only about 800k parameters (~20x smaller) with all weights quantized to 8 bits.

jmvalin avatar Jun 07 '25 20:06 jmvalin

Thanks a lot for the explanation @jmvalin!

Which are the gaps currently to a "full" FARGAN with the quantization / entropy coder for use in scenarios where low bitrate audio is the only option (I would say, 1.6 kbit/s or lower... codec2 sounds so bad in 2025)?

ps: could such reuse of the compression code of LPCNet to FARGAN be a fast-path to get something useful for this use-case in the near term?

rafael2k avatar Jun 07 '25 23:06 rafael2k

As a comparison, the vocos paper lists models with 14M parameters (and no mention of weight quantization), whereas the largest FARGAN model has only about 800k parameters (~20x smaller) with all weights quantized to 8 bits.

While the base model may be somewhat large (I guess you could optimize it to be 3-5M), you need to account for the fact that it is not in the same league as HiFi GAN.

The former operates in the audio domain, while latter in the frequency (i.e. mel-spec) domain. So these millions of parameters are not created equal.

I wonder if three powerful inductive biases (losses from HiFi, spectral domain and autocorrelation) could be used together.

snakers4 avatar Jun 08 '25 03:06 snakers4

Which are the gaps currently to a "full" FARGAN with the quantization / entropy coder for use in scenarios where low bitrate audio is the only option (I would say, 1.6 kbit/s or lower... codec2 sounds so bad in 2025)?

ps: could such reuse of the compression code of LPCNet to FARGAN be a fast-path to get something useful for this use-case in the near term?

So if you look at what happens when you use LPCNet as a 1.6 kbps codec, you have the following steps:

  1. Audio gets converted into 20 features (18 cepstral coeffs, 2 pitch coeffs)
  2. The audio features get encoded (vector-quantized) such that 4 vectors (40 ms) fit in 64 bits. That's what you transmit over the air.
  3. The 64-bit payload gets decoded back to four feature vectors
  4. The LPCNet vocoder takes the feature vectors and synthesizes them back to a speech waveform

The place where Opus, LPCNet and RADE differ are steps 2) and 3). In Opus, DRED uses a DNN to do a much better compression, but it's designed strictly for redundancy and would not work for you (it codes in 1-second chunks, backward in time). And RADE goes straight to baseband as you know. So the options are to either strip out just the encoder and decoder from LPCNet, or rewrite them with something better. To give you an idea, I uploaded a 1.1 kb/s DRED sample at https://jmvalin.ca/misc_stuff/dred_1100bps.wav . You cannot quite achieve that with a fixed bitrate, but it should give an idea of what's possible.

If you want to use the existing LPCNet encoder and decoder, basically you'll want to extract/adapt the first ~500 lines of lpcnet_enc.c as well as the the of the process_superframe() function. Then for the decoder, you can probably take lpcnet_dec.c almost as is. The features are the same between LPCNet and FARGAN, but there are a few subtle changes in how the pitch is searched in LPCNet to make it easier to quantize (may not matter anymore).

jmvalin avatar Jun 08 '25 14:06 jmvalin

Really thanks for the explanation @jmvalin. I'm definitely not lost anymore.

The place where Opus, LPCNet and RADE differ are steps 2) and 3). In Opus, DRED uses a DNN to do a much better compression, but it's designed strictly for redundancy and would not work for you (it codes in 1-second chunks, backward in time). And RADE goes straight to baseband as you know. So the options are to either strip out just the encoder and decoder from LPCNet, or rewrite them with something better. To give you an idea, I uploaded a 1.1 kb/s DRED sample at https://jmvalin.ca/misc_stuff/dred_1100bps.wav . You cannot quite achieve that with a fixed bitrate, but it should give an idea of what's possible.

This is better than what is currently available using open source at this bitrate level and potentially real time capable. I could understand 100% of the speeches!

If you want to use the existing LPCNet encoder and decoder, basically you'll want to extract/adapt the first ~500 lines of lpcnet_enc.c as well as the the of the process_superframe() function. Then for the decoder, you can probably take lpcnet_dec.c almost as is. The features are the same between LPCNet and FARGAN, but there are a few subtle changes in how the pitch is searched in LPCNet to make it easier to quantize (may not matter anymore).

Got it. Concerning the DRED / FARGAN C code I should grab it from Opus git repo, right?

rafael2k avatar Jun 09 '25 08:06 rafael2k

@jmvalin By the way, do you think that FARGAN may work well on 48 kHz audio as well? Did you maybe try going beyond 16 kHz?

snakers4 avatar Jun 09 '25 13:06 snakers4

By the way, do you think that FARGAN may work well on 48 kHz audio as well? Did you maybe try going beyond 16 kHz?

I never tried anything other than 16 kHz, but there's no reason it shouldn't work assuming everything gets updated accordingly. That being said, I think for 48 kHz, I might also consider just doing 16 kHz plus low-complexity bandwidth extension (https://arxiv.org/pdf/2412.11392).

jmvalin avatar Jun 12 '25 00:06 jmvalin

Got it. Concerning the DRED / FARGAN C code I should grab it from Opus git repo, right?

Correct.

jmvalin avatar Jun 12 '25 00:06 jmvalin

FARGAN is a complete vocoder just like LPCNet. In Opus, the FARGAN signal completely replaces (not enhance) the SILK PLC output.

Hi~ @jmvalin. May I ask you a question about training PLC? I got trouble in training the PLCModel. The PLCModel is used for conceal the feature after the PitchDNN model, before passing to FARGAN. I follow the code in https://gitlab.xiph.org/xiph/opus/-/tree/main/dnn/torch/plc?ref_type=heads. For training(train_plc.py) input, I take the features and loss generated from ./dump_data for data preparation https://gitlab.xiph.org/xiph/opus/-/tree/main/dnn/torch/fargan?ref_type=heads#data-preparation. My inputfeatures is float32, loss is 16bit pcm. But the conceal result of my self-train is worse than using the original weights, the white box regions are the lost signals. Training epoch is 20(default).

Image

I didn't modify the train_plc.py and any other params of the model. Is my training input correct? The output of the ./dump_data is 16-bit pcm, but the plc_dataset.py reads the int8 loss_file, is this the reason of my training gap?

I really want to know the input features and loss of the train_plc.py. How should I prepare for these two variables in the right way?

YYX666660 avatar Jun 12 '25 02:06 YYX666660

hi @jmvalin. I also try to train a new PLC model, but there isn't a README available yet. Unlike @YYX666660, I couldn't find a proper way to generate lost_file. Is it a single .txt file or a concatenation of smaller .txt files with one entry per 20ms packet, where 0 means "packet lost" and 1 means "packet not lost"? By the way, is the PLC pytorch code match the TF2 code in LPCNet?

Twilight89 avatar Jul 07 '25 03:07 Twilight89

The loss_file format is 8-bit binary with (I think) 0x00 meaning the packet arrived and 0x01 meaning the packet was lost. But it could be the other way around -- see what the code does. Note also that the feature file is different for the PLC as for DRED and FARGAN as it includes extra Burg-derived features. The dump_data option for that is -btrain (instead of -train).

jmvalin avatar Jul 08 '25 22:07 jmvalin

Thanks a lot for the explanation @jmvalin! I've been using the -train option...totally wrong!!! For the the loss_file, I think it plays a crucial role in training:

  • How about the length of the loss_file(should match the length of the feature?)
  • Can I generate it by combining several packet loss traces? For example, firstly I generate the sequence of “0" "1” of length 15(the training sequence-length is 15? or 1000?), then 100000 sequences will be generated, and lastly I will combine all the sequences as my loss_file.
  • How to control the loss rate of every packet loss trace? 20% loss for random or markov maybe? Is there an existing loss_file available?

Twilight89 avatar Jul 09 '25 12:07 Twilight89

hi @jmvalin. I want to ask a question about the time-consuming. Arm_neon inference is in the existing code. I measured the time consumption of running opus_demo(with compiling --enable_deep_plc) on a mac m3pro laptop for the opus_decode function, and the average time consumption is about 0.3ms, while on an iOS device (iphone12pro max) the average time is 2.5ms, and the peak time is 7ms (only 1.5ms for without enable_deep_plc). Both are on the Release mode. Have you ever measure the time of using dnn network online? Will there be further neon optimization?

YYX666660 avatar Jul 17 '25 12:07 YYX666660

Make sure you're compiling with the dotprod instructions enabled if you have them. That might be something like adding the -march=armv8.2-a+dotprod option. If you do that, you'll see that the code inside #ifdef __ARM_FEATURE_DOTPROD in vec_neon.h should get compiled.

jmvalin avatar Jul 17 '25 13:07 jmvalin

Thanks @jmvalin, got it. I've already used FARGAN. I add the -march=armv8.2-a+dotprod option for arm64 like this:

# autotools compile
./configure --enable-deep-plc CFLAGS="-DUSE_WEIGHTS_FILE"      # before adding dotprod
./configure --enable-deep-plc CFLAGS="-DUSE_WEIGHTS_FILE -march=armv8.2-a+dotprod"   # after adding dotprod

# cmake compile
-DCMAKE_C_FLAGS="-march=armv8.2-a+dotprod" \
-DCMAKE_CXX_FLAGS="-march=armv8.2-a+dotprod" \

But the average time consumption of opus_decode for iphone12promax is almost the same as before adding dotprod, but the time of MAC m3 is faster(0.1ms). Is my compile option wrong? Or is "-march=armv8.2-a+dotprod" option the default in ./configure?

YYX666660 avatar Jul 21 '25 13:07 YYX666660

It's probably just some kind of difference in the build system that causes the Neon/dotprod instructions not to be used and defaulting to scalar code.

jmvalin avatar Jul 22 '25 14:07 jmvalin

Got it. The build system of Mac/iOS is different, I need to check the compile of iOS system. Thanks a lot for the explanation~ @jmvalin

By the way, I'm still struggling with peak elapsed time. In the existing vec_neon.h, considering the high peak elapsed time, mainly originating from large matrix operations(e.g. rows=576, cols=192 or rows=480, cols=272...). How to optimize it further? I've already modified cgemv8x4 to cgemv32x4, cgemv64x4 to speed up. What else could I do?

YYX666660 avatar Jul 23 '25 03:07 YYX666660