vadnet
vadnet copied to clipboard
I have some problems with this project
Hello ,author! First of all, thank you very much for providing me with the ideas I realized.Then I have some questions:
- I have noticed that the neural network makes a classification decision each 1 second of audio,but It is possible to include speech and noise in one second, such as 30% noise and 70 voice, how to distinguish them?
- If a voice lasts for 1.2 seconds, the next 0.2 seconds of vocals may be classified as noise, resulting in incomplete speech segments, so how to solve this problem?
- I want to reduce the classification time, such as 500ms or 250ms, then whether to separate the training speech and noise into a file size of 500ms or 250ms, and then retrain a new model, so will it lead to a decline in the recognition rate?
I am looking forward to your answer, thank you again.
- No, a decision is made per frame (e.g. second). But you can do two things: train your network on a shorter window size (see e.g. #7) and increase overlapping, e.g. make a prediction every 0.1 s, and apply some post-processing to the sequence of decisions afterwards.
- Again, I suggest to increase overlapping between frames.
- See #7
OK,Thank you very much.