FullSubNet
FullSubNet copied to clipboard
Any suggestion to fine-tune with a small dataset?
Hi,
I tried fine-tuning with a small clean dataset of Vietnamese speech, self-collected from YouTube, about 100 hours of audio. Here are a few audio demos. However, the results did not meet my expectations.
Here’s how I prepare data:
- Clean dataset: I used the Vietnamese data mentioned above. I filtered the collected audio segments that were shorter than 3 seconds to match
sub_sample_length = 3.072
. - Noise dataset: I downloaded the DNS Interspeech 2020 noise data from here: DNS-Challenge noise data.
- RIR dataset: I downloaded the dataset from the release page here: RIR dataset.
- Test dataset: I used the test set from DNS-Challenge: Test set.
I used a 3080 GPU with a batch size of 12 and gradient accumulation steps set to 3. Model starting from the checkpoint fullsubnet_best_model_58epochs.tar.
I trained for 15 epochs. However, the loss decreased only in the first few epochs and then started increasing. When I tested the inference on a few samples, I noticed that the model left more noise compared to the original performance.
Am I missing something in the fine-tuning process? Do you have any advice for me?
Thank you!