whisperX icon indicating copy to clipboard operation
whisperX copied to clipboard

AttributeError: 'Wav2Vec2Processor' object has no attribute 'sampling_rate'

Open arabcoders opened this issue 11 months ago • 10 comments

Hello, I have simple project testing out whisperx, the test script

import json
import logging
import whisperx

model_opts = {
    "whisper_arch": "large-v2",
    "device": "cuda",
    "compute_type": "float16",
    "download_root": "/home/user/.config/whisper-models",
    "language": "ja"
}

trans_opts = {
    "temperatures": [
        0.0,
        0.2,
        0.4,
        0.6000000000000001,
        0.8,
        1.0
    ],
    "best_of": 5,
    "beam_size": 5,
    "patience": 2,
    "initial_prompt": None,
    "condition_on_previous_text": True,
    "compression_ratio_threshold": 2.4,
    "log_prob_threshold": -1.0,
    "no_speech_threshold": 0.6,
    "word_timestamps": False,
    "prepend_punctuations": "\"'“¿([{-",
    "append_punctuations": "\"'.。,,!!??::”)]}、",
    "max_new_tokens": None,
    "clip_timestamps": None,
    "hallucination_silence_threshold": None
}

filename = '/mnt/media/test.mkv';

model = whisperx.load_model(**model_opts, asr_options=trans_opts)
audio = whisperx.load_audio(filename)

results = model.transcribe(audio, batch_size=16)

device = 'cuda'

# 2. Align whisper output
model_a, metadata = whisperx.load_align_model(
    language_code=results["language"],
    device=device,
)

results = whisperx.align(results["segments"], model_a, metadata, audio, device, return_char_alignments=False)

logging.debug(json.dumps(results, indent=2, ensure_ascii=False))

leads to

/home/user/test/.venv/lib/python3.11/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.2.0.post0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../.cache/torch/whisperx-vad-segmentation.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.2.1+cu121. Bad things might happen unless you revert torch to 1.x.
Some weights of the model checkpoint at jonatasgrosman/wav2vec2-large-xlsr-53-japanese were not used when initializing Wav2Vec2ForCTC: ['wav2vec2.encoder.pos_conv_embed.conv.weight_g', 'wav2vec2.encoder.pos_conv_embed.conv.weight_v']
- This IS expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing Wav2Vec2ForCTC from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
Some weights of Wav2Vec2ForCTC were not initialized from the model checkpoint at jonatasgrosman/wav2vec2-large-xlsr-53-japanese and are newly initialized: ['wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original0', 'wav2vec2.encoder.pos_conv_embed.conv.parametrizations.weight.original1']
You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference.
Traceback (most recent call last):
  File "/home/user/test/test.py", line 54, in <module>
    results = whisperx.align(results["segments"], model_a, metadata, audio, device, return_char_alignments=False)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/user/test/.venv/lib/python3.11/site-packages/whisperx/alignment.py", line 232, in align
    inputs = processor(waveform_segment.squeeze(), sampling_rate=processor.sampling_rate, return_tensors="pt").to(device)
                                                                 ^^^^^^^^^^^^^^^^^^^^^^^
AttributeError: 'Wav2Vec2Processor' object has no attribute 'sampling_rate'

I am unable to get it working at all. testing just faster-whisper works ok it seems there is problem with the Wav2Vec model.

arabcoders avatar Feb 26 '24 18:02 arabcoders

I have the same issue, does anyone get around this? It might be caused by some breaking changes in transformers, I'll try downgrading transformers.

frodo821 avatar Mar 03 '24 07:03 frodo821

Finally I solved this error rewriting alignment.py like this:

-                     inputs = processor(waveform_segment.squeeze(), sampling_rate=processor.sampling_rate, return_tensors="pt").to(device)
+                     inputs = processor(waveform_segment.squeeze(), sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt").to(device)

frodo821 avatar Mar 03 '24 08:03 frodo821

Finally I solved this error rewriting alignment.py like this:

-                     inputs = processor(waveform_segment.squeeze(), sampling_rate=processor.sampling_rate, return_tensors="pt").to(device)
+                     inputs = processor(waveform_segment.squeeze(), sampling_rate=processor.feature_extractor.sampling_rate, return_tensors="pt").to(device)

Thanks, i've made small patch file that make it backwards compatible

--- .venv/lib/python3.11/site-packages/whisperx/alignment.py	2024-03-03 17:22:05.042130573 +0300
+++ .venv/lib/python3.11/site-packages/whisperx/alignment.py	2024-03-03 17:25:20.760972944 +0300
@@ -229,7 +229,13 @@
                 emissions, _ = model(waveform_segment.to(device), lengths=lengths)
             elif model_type == "huggingface":
                 if preprocess:
-                    inputs = processor(waveform_segment.squeeze(), sampling_rate=processor.sampling_rate, return_tensors="pt").to(device)
+                    sample_rate = None
+                    if 'sampling_rate' in processor.__dict__:
+                        sample_rate = processor.sampling_rate
+                    if 'feature_extractor' in processor.__dict__ and 'sampling_rate' in processor.feature_extractor.__dict__:
+                        sample_rate = processor.feature_extractor.sampling_rate
+
+                    inputs = processor(waveform_segment.squeeze(), sampling_rate=sample_rate, return_tensors="pt").to(device)
                     emissions = model(**inputs).logits
                 else:
                     emissions = model(waveform_segment.to(device)).logits

arabcoders avatar Mar 03 '24 16:03 arabcoders

How did you solve it, I tried to find the code you mentioned its not exist.

alfahadgm avatar Mar 12 '24 12:03 alfahadgm

I also don't see the code referenced above.

melanie-rosenberg avatar Mar 15 '24 15:03 melanie-rosenberg

@alfahadgm @melanie-rosenberg, i am unsure why but this fix intended for v3.1.2, which it seems has been removed from the repo for some reason.

Maybe @m-bain can shed some light on why

arabcoders avatar Mar 15 '24 16:03 arabcoders

Thank you @arabcoders -- applying the patch worked while using v3.1.2.

melanie-rosenberg avatar Mar 15 '24 17:03 melanie-rosenberg

FYI @alfahadgm running this also worked: pip install -U git+https://github.com/m-bain/whisperX.git@78dcfaab51005aa703ee21375f81ed31bc248560

melanie-rosenberg avatar Mar 15 '24 17:03 melanie-rosenberg

Here's some info about the PyPI release vs this repo in case anyone else is confused like I was: It seems like the PyPI releases are created by someone other than the maintainer of this repo according to https://github.com/m-bain/whisperX/issues/700#issuecomment-1957790696. The above patch works on top of this PR https://github.com/m-bain/whisperX/pull/625.

HHousen avatar Mar 25 '24 06:03 HHousen

@HHousen Any chance you could submit a PR to get that change merged?

eschmidbauer avatar May 09 '24 13:05 eschmidbauer