fairseq
fairseq copied to clipboard
Unable to load Wav2Vec 2.0 models - wav2vec2_vox_960h_new.pt
🐛 Bug
Hello.
Firstly , thank you for sharing all of the work and results and code. Its no small task.
I am attempting to load wav2vec2_vox_960h_new.pt
but am getting the following errors:
TypeError: object of type 'NoneType' has no len()
after calling
model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(['wav2vec2_vox_960h_new.pt'])
To Reproduce
install torch for cuda 11.6 via website docs:
conda install pytorch torchvision torchaudio cudatoolkit=11.6 -c pytorch -c conda-forge
install dev fairseq:
git clone https://github.com/pytorch/fairseq
cd fairseq
pip install --editable ./
in python notebook or wherever:
import torch
import fairseq
print(torch.__version__)
print(fairseq.__version__)
# I see
# 1.12.1
# 0.12.2
use_cuda = torch.cuda.is_available()
print(use_cuda)
# True for me
# load model
model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(['wav2vec2_vox_960h_new.pt'])
I am then greeted with the following error
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
Input In [5], in <cell line: 1>()
----> 1 model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(['wav2vec2_vox_960h_new.pt'])
File ~/miniconda3/envs/pyav-wav2vec/lib/python3.9/site-packages/fairseq/checkpoint_utils.py:473, in load_model_ensemble_and_task(filenames, arg_overrides, task, strict, suffix, num_shards, state)
471 argspec = inspect.getfullargspec(task.build_model)
472 if "from_checkpoint" in argspec.args:
--> 473 model = task.build_model(cfg.model, from_checkpoint=True)
474 else:
475 model = task.build_model(cfg.model)
File ~/miniconda3/envs/pyav-wav2vec/lib/python3.9/site-packages/fairseq/tasks/audio_pretraining.py:197, in AudioPretrainingTask.build_model(self, model_cfg, from_checkpoint)
196 def build_model(self, model_cfg: FairseqDataclass, from_checkpoint=False):
--> 197 model = super().build_model(model_cfg, from_checkpoint)
199 actualized_cfg = getattr(model, "cfg", None)
200 if actualized_cfg is not None:
201 # if "w2v_args" in actualized_cfg:
File ~/miniconda3/envs/pyav-wav2vec/lib/python3.9/site-packages/fairseq/tasks/fairseq_task.py:338, in FairseqTask.build_model(self, cfg, from_checkpoint)
326 """
327 Build the :class:`~fairseq.models.BaseFairseqModel` instance for this
328 task.
(...)
334 a :class:`~fairseq.models.BaseFairseqModel` instance
335 """
336 from fairseq import models, quantization_utils
--> 338 model = models.build_model(cfg, self, from_checkpoint)
339 model = quantization_utils.quantize_model_scalar(model, cfg)
340 return model
File ~/miniconda3/envs/pyav-wav2vec/lib/python3.9/site-packages/fairseq/models/__init__.py:106, in build_model(cfg, task, from_checkpoint)
98 ARCH_CONFIG_REGISTRY[model_type](cfg)
100 assert model is not None, (
101 f"Could not infer model type from {cfg}. "
102 "Available models: {}".format(MODEL_DATACLASS_REGISTRY.keys())
103 + f" Requested model type: {model_type}"
104 )
--> 106 return model.build_model(cfg, task)
File ~/miniconda3/envs/pyav-wav2vec/lib/python3.9/site-packages/fairseq/models/wav2vec/wav2vec2_asr.py:208, in Wav2VecCtc.build_model(cls, cfg, task)
205 @classmethod
206 def build_model(cls, cfg: Wav2Vec2CtcConfig, task: FairseqTask):
207 """Build a new model instance."""
--> 208 w2v_encoder = Wav2VecEncoder(cfg, len(task.target_dictionary))
209 return cls(cfg, w2v_encoder)
TypeError: object of type 'NoneType' has no len()
Code sample
See above
Expected behavior
a properly loaded model.
Environment
-
fairseq Version 0.12.2
-
PyTorch Version 1.12.1
-
OS (e.g., Linux):
Linux frank-exchange-of-views 5.15.0-43-generic #46~20.04.1-Ubuntu SMP Thu Jul 14 15:20:17 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux
-
How you installed fairseq (
pip
, source): pip install --editable ./ -
Build command you used (if compiling from source):
-
Python version: 3.8.10
-
CUDA/cuDNN version: 11.6 / 510.85.02
-
GPU models and configuration: 2x 3090
-
Any other relevant information:
It seems almost all wav2vec2 models dont load properly. Ive tried a variety of calls, and looked through git. Documentation for properly loading these models is sorely lacking
I understanding HuggingFace Transformers may be the preferred way these days to use these models, but it seems very odd to me that there's such a variety of model loading methods, quirks, and special sauce - none of which seems properly documented, reproducible or available.
Is there a resource that I perhaps have missed that properly documents how to use these models?
Thank you in advance
i have the same question 😭
Same problem here
This is because someone refactored code.
wav2vec2_vox_960h_new.pt
is originally trained with audio_pretraining
task but in the latest fairseq version, audio_finetuning
task is needed to load model.
if u try to use audio_pretraining
to load finetuned model, error occurs because there are no ctc projection head for contrastive learning
so follow the below lines
firstly, download the model
mkdir -p /tmp_w2v2
cd /tmp_w2v2
wget https://dl.fbaipublicfiles.com/fairseq/wav2vec/wav2vec2_vox_960h_new.pt
now we try to load with audio_pretraining
task,
import torch
import fairseq
model_path='/tmp_w2v2/wav2vec2_vox_960h_new.pt'
model, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task([model_path])
we got familiar error
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/workspace/fairseq/fairseq/checkpoint_utils.py", line 486, in load_model_ensemble_and_task
model = task.build_model(cfg.model, from_checkpoint=True)
File "/workspace/fairseq/fairseq/tasks/audio_pretraining.py", line 218, in build_model
model = super().build_model(model_cfg, from_checkpoint)
File "/workspace/fairseq/fairseq/tasks/fairseq_task.py", line 340, in build_model
model = models.build_model(cfg, self, from_checkpoint)
File "/workspace/fairseq/fairseq/models/__init__.py", line 111, in build_model
return model.build_model(cfg, task)
File "/workspace/fairseq/fairseq/models/wav2vec/wav2vec2_asr.py", line 242, in build_model
w2v_encoder = Wav2VecEncoder(cfg, len(task.target_dictionary))
TypeError: object of type 'NoneType' has no len()
you need to load model with overriding like below (before using code, you need to dictionary for wav2vec2_vox_960h_new.pt, dict file should look like below)
import os
import torch
import fairseq
model_path='/tmp_w2v2/wav2vec2_vox_960h_new.pt'
path, checkpoint = os.path.split(model_path)
# overrides with audio_finetuning task
overrides = {
"task": 'audio_finetuning',
"data": path,
}
models, cfg, task = fairseq.checkpoint_utils.load_model_ensemble_and_task(
fairseq.utils.split_paths(checkpoint, separator="\\"),
arg_overrides=overrides,
strict=True,
)
model = models[0]
you need dict.ltr.txt
for wav2vec2_vox_960h_new.pt
.
you can get dict.ltr.txt
easily if you follow the wav2vec2 training guide
(py38) root@557bec2a5c9d:/tmp_w2v2# pwd && ls
/tmp_w2v2
dict.ltr.txt wav2vec2_vox_960h_new.pt
| 94802
E 51860
T 38431
A 33152
O 31495
N 28855
I 28794
H 27187
S 26071
R 23546
D 18289
L 16308
U 12400
M 10685
W 10317
C 9844
F 9062
G 8924
Y 8226
P 6890
B 6339
V 3936
K 3456
' 1023
X 636
J 598
Q 437
Z 213
or you can my preprocessed dict, dict.ltr.txt
now you successfully loaded model. try to inference using toy net input
use_fp16 = cfg.common.fp16
use_cuda = torch.cuda.is_available()
if use_cuda : model.cuda()
if use_fp16 : model.half()
model.eval()
toy_net_input = {
"source" : torch.FloatTensor(1,150000),
"padding_mask" : None
}
def apply_half(t):
if t.dtype is torch.float32:
return t.to(dtype=torch.half)
return t
if use_fp16:
toy_net_input = fairseq.utils.apply_to_sample(apply_half, toy_net_input)
if use_cuda:
toy_net_input = fairseq.utils.move_to_cuda(toy_net_input)
toy_net_output = model(**toy_net_input)
>>> toy_net_output['encoder_out'].size()
torch.Size([468, 1, 32])
Thank you @SeunghyunSEO - will try that shortly. Much obliged.
Hello, I ran into the same problem. After following the famous issue #2651 and debugging the recognize.py script, I came up with an implementation that actually worked for me without any errors. Here is the script below:
# run ASR inference using a wav2vec2 ASR model and a specified decoder on a single audio file.
# used for wav2vec2 ASR checkpoints that, when loaded, have an 'args' key but no 'cfg' key.
import torch
import soundfile as sf
from argparse import Namespace
import torch.nn.functional as F
from fairseq.data import Dictionary
from fairseq.data.data_utils import post_process
from examples.speech_recognition.w2l_decoder import W2lViterbiDecoder
from fairseq.models.wav2vec.wav2vec2_asr import Wav2VecCtc, Wav2Vec2CtcConfig
def get_config_dict(args):
if isinstance(args, Namespace):
# unpack Namespace into base dict obj
args = vars(args)
fields = Wav2Vec2CtcConfig.__dataclass_fields__
# create dict for attributes of Wav2Vec2CtcConfig with vals taken from the same key in args, if they exist
fields_dict = {}
# this means Wav2Vec2CtcConfig obj fields will be overwritten with vals from args, otherwise they will be default
for field in fields.keys():
if field in args:
fields_dict[field] = args[field]
return fields_dict
def get_feature(filepath):
def postprocess(feats, sample_rate):
if feats.dim == 2:
feats = feats.mean(-1)
assert feats.dim() == 1, feats.dim()
with torch.no_grad():
feats = F.layer_norm(feats, feats.shape)
return feats
wav, sample_rate = sf.read(filepath)
feats = torch.from_numpy(wav).float()
feats = postprocess(feats, sample_rate)
feats = feats.cuda()
return feats
if __name__ == "__main__":
model_path = "/path/to/wav2vec2_vox_960h_new.pt"
target_dict = Dictionary.load('/path/to/corresponding/dict.ltr.txt')
w2v = torch.load(model_path)
args_dict = get_config_dict(w2v['args'])
w2v_config_obj = Wav2Vec2CtcConfig(**args_dict)
dummy_target_dict = {'target_dictionary' : target_dict.symbols}
dummy_target_dict = Namespace(**dummy_target_dict)
model = Wav2VecCtc.build_model(w2v_config_obj, dummy_target_dict)
model.load_state_dict(w2v["model"], strict=True)
model = model.cuda()
model.eval()
sample, input = dict(), dict()
WAV_PATH = '/path/to/speech.wav'
# define additional decoder args
decoder_args = Namespace(**{'nbest': 1})
generator = W2lViterbiDecoder(decoder_args, target_dict)
feature = get_feature(WAV_PATH)
input["source"] = feature.unsqueeze(0)
padding_mask = torch.BoolTensor(input["source"].size(1)).fill_(False).unsqueeze(0)
input["padding_mask"] = padding_mask
sample["net_input"] = input
models = list()
models.append(model)
with torch.no_grad():
hypo = generator.generate(models, sample, prefix_tokens=None)
hyp_pieces = target_dict.string(hypo[0][0]["tokens"].int().cpu())
res = post_process(hyp_pieces, 'letter')
Versions that I used:
- torch version 1.12.1+cu113
- fairseq version 1.0.0a0+35cc605
- flashlight version 1.0.0
- CUDA version 11.4
- Cuda compilation tools, release 11.1, V11.1.105
This was run inside a docker container.