vadnet icon indicating copy to clipboard operation
vadnet copied to clipboard

I have some problems with this project

Open JunGenius opened this issue 5 years ago • 2 comments

Hello ,author! First of all, thank you very much for providing me with the ideas I realized.Then I have some questions:

  1. I have noticed that the neural network makes a classification decision each 1 second of audio,but It is possible to include speech and noise in one second, such as 30% noise and 70 voice, how to distinguish them?
  2. If a voice lasts for 1.2 seconds, the next 0.2 seconds of vocals may be classified as noise, resulting in incomplete speech segments, so how to solve this problem?
  3. I want to reduce the classification time, such as 500ms or 250ms, then whether to separate the training speech and noise into a file size of 500ms or 250ms, and then retrain a new model, so will it lead to a decline in the recognition rate?

I am looking forward to your answer, thank you again.

JunGenius avatar Jun 25 '19 16:06 JunGenius

  1. No, a decision is made per frame (e.g. second). But you can do two things: train your network on a shorter window size (see e.g. #7) and increase overlapping, e.g. make a prediction every 0.1 s, and apply some post-processing to the sequence of decisions afterwards.
  2. Again, I suggest to increase overlapping between frames.
  3. See #7

frankenjoe avatar Jul 02 '19 06:07 frankenjoe

OK,Thank you very much.

JunGenius avatar Jul 17 '19 16:07 JunGenius