fairseq icon indicating copy to clipboard operation
fairseq copied to clipboard

Unable to load Wav2Vec 2.0 models - wav2vec2_vox_960h_new.pt

Open vade opened this issue 2 years ago • 5 comments

🐛 Bug

Hello.

Firstly , thank you for sharing all of the work and results and code. Its no small task.

I am attempting to load wav2vec2_vox_960h_new.pt but am getting the following errors:

TypeError: object of type 'NoneType' has no len()

after calling

model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(['wav2vec2_vox_960h_new.pt'])

To Reproduce

install torch for cuda 11.6 via website docs:

conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge

install dev fairseq:

git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./

in python notebook or wherever:

import torch
import fairseq

print(torch.__version__)
print(fairseq.__version__)
# I see 
# 1.12.1
# 0.12.2

use_cuda = torch.cuda.is_available()

print(use_cuda)
# True for me

# load model

model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(['wav2vec2_vox_960h_new.pt'])

I am then greeted with the following error

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(['wav2vec2_vox_960h_new.pt'])

File ~/miniconda3/envs/pyav-wav2vec/lib/python3.9/site-packages/fairseq/checkpoint_utils.py:473, in load_model_ensemble_and_task(filenames, arg_overrides, task, strict, suffix, num_shards, state)
    471 argspec = inspect.getfullargspec(task.build_model)
    472 if "from_checkpoint" in argspec.args:
--> 473     model = task.build_model(cfg.model, from_checkpoint=True)
    474 else:
    475     model = task.build_model(cfg.model)

File ~/miniconda3/envs/pyav-wav2vec/lib/python3.9/site-packages/fairseq/tasks/audio_pretraining.py:197, in AudioPretrainingTask.build_model(self, model_cfg, from_checkpoint)
    196 def build_model(self, model_cfg: FairseqDataclass, from_checkpoint=False):
--> 197     model = super().build_model(model_cfg, from_checkpoint)
    199     actualized_cfg = getattr(model, "cfg", None)
    200     if actualized_cfg is not None:
    201         # if "w2v_args" in actualized_cfg:

File ~/miniconda3/envs/pyav-wav2vec/lib/python3.9/site-packages/fairseq/tasks/fairseq_task.py:338, in FairseqTask.build_model(self, cfg, from_checkpoint)
    326 """
    327 Build the :class:`~fairseq.models.BaseFairseqModel` instance for this
    328 task.
   (...)
    334     a :class:`~fairseq.models.BaseFairseqModel` instance
    335 """
    336 from fairseq import models, quantization_utils
--> 338 model = models.build_model(cfg, self, from_checkpoint)
    339 model = quantization_utils.quantize_model_scalar(model, cfg)
    340 return model

File ~/miniconda3/envs/pyav-wav2vec/lib/python3.9/site-packages/fairseq/models/__init__.py:106, in build_model(cfg, task, from_checkpoint)
     98             ARCH_CONFIG_REGISTRY[model_type](cfg)
    100 assert model is not None, (
    101     f"Could not infer model type from {cfg}. "
    102     "Available models: {}".format(MODEL_DATACLASS_REGISTRY.keys())
    103     + f" Requested model type: {model_type}"
    104 )
--> 106 return model.build_model(cfg, task)

File ~/miniconda3/envs/pyav-wav2vec/lib/python3.9/site-packages/fairseq/models/wav2vec/wav2vec2_asr.py:208, in Wav2VecCtc.build_model(cls, cfg, task)
    205 @classmethod
    206 def build_model(cls, cfg: Wav2Vec2CtcConfig, task: FairseqTask):
    207     """Build a new model instance."""
--> 208     w2v_encoder = Wav2VecEncoder(cfg, len(task.target_dictionary))
    209     return cls(cfg, w2v_encoder)

TypeError: object of type 'NoneType' has no len()

Code sample

See above

Expected behavior

a properly loaded model.

Environment

  • fairseq Version 0.12.2

  • PyTorch Version 1.12.1

  • OS (e.g., Linux): Linux frank-exchange-of-views 5.15.0-43-generic #46~20.04.1-Ubuntu SMP Thu Jul 14 15:20:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

  • How you installed fairseq (pip, source): pip install --editable ./

  • Build command you used (if compiling from source):

  • Python version: 3.8.10

  • CUDA/cuDNN version: 11.6 / 510.85.02

  • GPU models and configuration: 2x 3090

  • Any other relevant information:

It seems almost all wav2vec2 models dont load properly. Ive tried a variety of calls, and looked through git. Documentation for properly loading these models is sorely lacking

I understanding HuggingFace Transformers may be the preferred way these days to use these models, but it seems very odd to me that there's such a variety of model loading methods, quirks, and special sauce - none of which seems properly documented, reproducible or available.

Is there a resource that I perhaps have missed that properly documents how to use these models?

Thank you in advance

vade avatar Sep 01 '22 21:09 vade

i have the same question 😭

KrystalBling avatar Sep 03 '22 15:09 KrystalBling

Same problem here

iulianaciobanitei avatar Sep 12 '22 08:09 iulianaciobanitei

This is because someone refactored code. wav2vec2_vox_960h_new.pt is originally trained with audio_pretraining task but in the latest fairseq version, audio_finetuning task is needed to load model. if u try to use audio_pretraining to load finetuned model, error occurs because there are no ctc projection head for contrastive learning

so follow the below lines

firstly, download the model

mkdir -p /tmp_w2v2
cd /tmp_w2v2
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec2_vox_960h_new.pt

now we try to load with audio_pretraining task,

import torch
import fairseq

model_path='/tmp_w2v2/wav2vec2_vox_960h_new.pt'
model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([model_path])

we got familiar error

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/workspace/fairseq/fairseq/checkpoint_utils.py", line 486, in load_model_ensemble_and_task
    model = task.build_model(cfg.model, from_checkpoint=True)
  File "/workspace/fairseq/fairseq/tasks/audio_pretraining.py", line 218, in build_model
    model = super().build_model(model_cfg, from_checkpoint)
  File "/workspace/fairseq/fairseq/tasks/fairseq_task.py", line 340, in build_model
    model = models.build_model(cfg, self, from_checkpoint)
  File "/workspace/fairseq/fairseq/models/__init__.py", line 111, in build_model
    return model.build_model(cfg, task)
  File "/workspace/fairseq/fairseq/models/wav2vec/wav2vec2_asr.py", line 242, in build_model
    w2v_encoder = Wav2VecEncoder(cfg, len(task.target_dictionary))
TypeError: object of type 'NoneType' has no len()

you need to load model with overriding like below (before using code, you need to dictionary for wav2vec2_vox_960h_new.pt, dict file should look like below)

import os
import torch
import fairseq

model_path='/tmp_w2v2/wav2vec2_vox_960h_new.pt'
path, checkpoint = os.path.split(model_path)

# overrides with audio_finetuning task
overrides = {
    "task": 'audio_finetuning',
    "data": path,
}
models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(
    fairseq.utils.split_paths(checkpoint, separator="\\"),
    arg_overrides=overrides,
    strict=True,
)
model = models[0]

you need dict.ltr.txt for wav2vec2_vox_960h_new.pt. you can get dict.ltr.txt easily if you follow the wav2vec2 training guide

(py38) root@557bec2a5c9d:/tmp_w2v2# pwd && ls
/tmp_w2v2
dict.ltr.txt  wav2vec2_vox_960h_new.pt
| 94802
E 51860
T 38431
A 33152
O 31495
N 28855
I 28794
H 27187
S 26071
R 23546
D 18289
L 16308
U 12400
M 10685
W 10317
C 9844
F 9062
G 8924
Y 8226
P 6890
B 6339
V 3936
K 3456
' 1023
X 636
J 598
Q 437
Z 213

or you can my preprocessed dict, dict.ltr.txt

now you successfully loaded model. try to inference using toy net input

use_fp16 = cfg.common.fp16
use_cuda = torch.cuda.is_available()

if use_cuda : model.cuda()
if use_fp16 : model.half()
model.eval()

toy_net_input = {
    "source" : torch.FloatTensor(1,150000),
    "padding_mask" : None
}

def apply_half(t):
    if t.dtype is torch.float32:
        return t.to(dtype=torch.half)
    return t

if use_fp16:
    toy_net_input = fairseq.utils.apply_to_sample(apply_half, toy_net_input)
if use_cuda:
    toy_net_input = fairseq.utils.move_to_cuda(toy_net_input)

toy_net_output = model(**toy_net_input)
>>> toy_net_output['encoder_out'].size()
torch.Size([468, 1, 32])

SeunghyunSEO avatar Sep 15 '22 19:09 SeunghyunSEO

Thank you @SeunghyunSEO - will try that shortly. Much obliged.

vade avatar Sep 21 '22 16:09 vade

Hello, I ran into the same problem. After following the famous issue #2651 and debugging the recognize.py script, I came up with an implementation that actually worked for me without any errors. Here is the script below:

# run ASR inference using a wav2vec2 ASR model and a specified decoder on a single audio file.
# used for wav2vec2 ASR checkpoints that, when loaded, have an 'args' key but no 'cfg' key.

import torch
import soundfile as sf
from argparse import Namespace
import torch.nn.functional as F
from fairseq.data import Dictionary
from fairseq.data.data_utils import post_process
from examples.speech_recognition.w2l_decoder import W2lViterbiDecoder
from fairseq.models.wav2vec.wav2vec2_asr import Wav2VecCtc, Wav2Vec2CtcConfig


def get_config_dict(args):
    if isinstance(args, Namespace):
        # unpack Namespace into base dict obj
        args = vars(args)
    fields = Wav2Vec2CtcConfig.__dataclass_fields__
    # create dict for attributes of Wav2Vec2CtcConfig with vals taken from the same key in args, if they exist
    fields_dict = {}
    # this means Wav2Vec2CtcConfig obj fields will be overwritten with vals from args, otherwise they will be default
    for field in fields.keys():
        if field in args:
            fields_dict[field] = args[field]

    return fields_dict


def get_feature(filepath):
    def postprocess(feats, sample_rate):
        if feats.dim == 2:
            feats = feats.mean(-1)

        assert feats.dim() == 1, feats.dim()

        with torch.no_grad():
            feats = F.layer_norm(feats, feats.shape)
        return feats

    wav, sample_rate = sf.read(filepath)
    feats = torch.from_numpy(wav).float()
    feats = postprocess(feats, sample_rate)
    feats = feats.cuda()

    return feats


if __name__ == "__main__":
    model_path = "/path/to/wav2vec2_vox_960h_new.pt"
    target_dict = Dictionary.load('/path/to/corresponding/dict.ltr.txt')

    w2v = torch.load(model_path)

    args_dict = get_config_dict(w2v['args'])
    w2v_config_obj = Wav2Vec2CtcConfig(**args_dict)

    dummy_target_dict = {'target_dictionary' : target_dict.symbols}
    dummy_target_dict = Namespace(**dummy_target_dict)

    model = Wav2VecCtc.build_model(w2v_config_obj, dummy_target_dict)
    model.load_state_dict(w2v["model"], strict=True)
    model = model.cuda()
    model.eval()

    sample, input = dict(), dict()
    WAV_PATH = '/path/to/speech.wav'

    # define additional decoder args
    decoder_args = Namespace(**{'nbest': 1})
    generator = W2lViterbiDecoder(decoder_args, target_dict)

    feature = get_feature(WAV_PATH)
    input["source"] = feature.unsqueeze(0)

    padding_mask = torch.BoolTensor(input["source"].size(1)).fill_(False).unsqueeze(0)

    input["padding_mask"] = padding_mask
    sample["net_input"] = input

    models = list()
    models.append(model)

    with torch.no_grad():
        hypo = generator.generate(models, sample, prefix_tokens=None)

    hyp_pieces = target_dict.string(hypo[0][0]["tokens"].int().cpu())


    res = post_process(hyp_pieces, 'letter')

Versions that I used:

  • torch version 1.12.1+cu113
  • fairseq version 1.0.0a0+35cc605
  • flashlight version 1.0.0
  • CUDA version 11.4
  • Cuda compilation tools, release 11.1, V11.1.105

This was run inside a docker container.

abarcovschi avatar Oct 25 '22 09:10 abarcovschi