FullSubNet icon indicating copy to clipboard operation
FullSubNet copied to clipboard

Any suggestion to fine-tune with a small dataset?

Open hahunavth opened this issue 4 months ago • 0 comments

Hi,

I tried fine-tuning with a small clean dataset of Vietnamese speech, self-collected from YouTube, about 100 hours of audio. Here are a few audio demos. However, the results did not meet my expectations.

Here’s how I prepare data:

  • Clean dataset: I used the Vietnamese data mentioned above. I filtered the collected audio segments that were shorter than 3 seconds to match sub_sample_length = 3.072.
  • Noise dataset: I downloaded the DNS Interspeech 2020 noise data from here: DNS-Challenge noise data.
  • RIR dataset: I downloaded the dataset from the release page here: RIR dataset.
  • Test dataset: I used the test set from DNS-Challenge: Test set.

I used a 3080 GPU with a batch size of 12 and gradient accumulation steps set to 3. Model starting from the checkpoint fullsubnet_best_model_58epochs.tar.

I trained for 15 epochs. However, the loss decreased only in the first few epochs and then started increasing. When I tested the inference on a few samples, I noticed that the model left more noise compared to the original performance.

Am I missing something in the fine-tuning process? Do you have any advice for me?

Thank you!

hahunavth avatar Oct 10 '24 03:10 hahunavth