Wav2Keyword icon indicating copy to clipboard operation
Wav2Keyword copied to clipboard

Inference

Open codeghees opened this issue 3 years ago • 15 comments

Hi! Great work with this. Was able to reproduce your results I think. @qute012 Two questions - what is the best way to run inference on the trained model? Any sample you have? Secondly, I was getting an error on fine-tuning a model trained on Google speech commands to my Urdu dataset. cfg = convert_namespace_to_omegaconf(state_dict['args']) Error was a key error 'args' not found. What am I doing wrong? Was passing the .pth model. I checked the model was being loaded.

Any help would be appreciated.

codeghees avatar Jul 07 '21 08:07 codeghees

The test accuracy for 10 samples for each keyword is over 94 percent. Sounds too good to be true.

codeghees avatar Jul 07 '21 08:07 codeghees

Hello, @codeghees. Could you please provide me with a requirement.txt file or a conda environment.yml file for the environment you used while reproducing the results? I tried to reproduce the results on the google speech v2 dataset and was faced with the same errors.

BeardyMan37 avatar Jul 08 '21 11:07 BeardyMan37

Hi~ @codeghees @BeardyMan37

Thank for concerning this project. Truly, i can't afford to maintain this project and can't access server now also ; ( If i have time, i would prefer to develop this project for inferencing. But you guys can reproduce this project referring hyperparameters and model architecture.

Sorry 😐

dobby-seo avatar Jul 08 '21 11:07 dobby-seo

can you point me to a direction for inference?

codeghees avatar Jul 08 '21 12:07 codeghees

I can build it myself.

@BeardyMan37 I used Google Colab.

codeghees avatar Jul 08 '21 12:07 codeghees

@codeghees

  1. extract loudest section Most important for accuracy, because this model can get only 1 sec raw audio file. So you have to check out extracted signal contains voice actually.
def extract_loudest_section(self, wav, win_len=30):
        wav_len = len(wav)
        temp = abs(wav)

        st,et = 0,0
        max_dec = 0

        for ws in range(0, wav_len, win_len):
            cur_dec = temp[ws:ws+16000].sum()
            if cur_dec >= max_dec:
                max_dec = cur_dec
                st,et = ws, ws+16000
            if ws+16000 > wav_len:
                break

        return wav[st:et]
  1. post process (in fairseq) You don't need to normalize raw audio. And i think it works nothing, i just add it for Wav2Vec 2.0 pipeline. I'm not sure, but it doesn't matter to remove this function.
 def postprocess(self, feats, curr_sample_rate):
        if feats.dim() == 2:
            feats = feats.mean(-1)

        if curr_sample_rate != self.sample_rate:
            raise Exception(f"sample rate: {curr_sample_rate}, need {self.sample_rate}")

        assert feats.dim() == 1, feats.dim()

        if self.normalize:
            with torch.no_grad():
                feats = F.layer_norm(feats, feats.shape)
        return feats
  1. make single batch to feed to model.

  2. predict class from argmax of model output

dobby-seo avatar Jul 08 '21 12:07 dobby-seo

Also how do we get which index represents which class i.e 0 for "UP" - is that positioning of the item in the index array?

codeghees avatar Jul 08 '21 12:07 codeghees

@codeghees

Yes, right! like simple classification other method :D

dobby-seo avatar Jul 08 '21 12:07 dobby-seo

Oh I meant - how do we know the mapping. Does that come from the CLASSES array?

Thanks!

codeghees avatar Jul 08 '21 12:07 codeghees

Yes. If you can produce training environment, can you PR for others?

dobby-seo avatar Jul 08 '21 12:07 dobby-seo

I will go back and check - I just opened colab and followed the instructions. - What is the exact error @BeardyMan37?

codeghees avatar Jul 08 '21 12:07 codeghees

Managed to resolve it. @codeghees

BeardyMan37 avatar Jul 08 '21 12:07 BeardyMan37

@qute012 attaching both the requirement.txt and the environment.yml file for your reference.

BeardyMan37 avatar Jul 08 '21 12:07 BeardyMan37

Hi! Great work with this. Was able to reproduce your results I think. @qute012 Two questions - what is the best way to run inference on the trained model? Any sample you have? Secondly, I was getting an error on fine-tuning a model trained on Google speech commands to my Urdu dataset. cfg = convert_namespace_to_omegaconf(state_dict['args']) Error was a key error 'args' not found. What am I doing wrong? Was passing the .pth model. I checked the model was being loaded.

Any help would be appreciated.

Hi! Great work with this. Was able to reproduce your results I think. @qute012 Two questions - what is the best way to run inference on the trained model? Any sample you have? Secondly, I was getting an error on fine-tuning a model trained on Google speech commands to my Urdu dataset. cfg = convert_namespace_to_omegaconf(state_dict['args']) Error was a key error 'args' not found. What am I doing wrong? Was passing the .pth model. I checked the model was being loaded.

Any help would be appreciated.

hello @codeghees . I encountered the same error while trying to finetune a huggingface wav2vec model with fairseq. Have you found out a method to convert a huggingface model(.bin) to fairseq checkpoint(.pt)?

alirezafarashah avatar Mar 02 '22 09:03 alirezafarashah

@codeghees can you please guide me or give me the link of your colab file? I want to reproduce this result and apply same strategy on Urdu Language.

salmaShahid avatar Apr 16 '23 18:04 salmaShahid