speechlib icon indicating copy to clipboard operation
speechlib copied to clipboard

transcription in logs file is empty

Open PiotrEsse opened this issue 1 year ago • 3 comments

Hi, thank You for Your work but I am having issues. Theres no error but after run your example I am getting an almost empty file in logs: In the file theres only following string:
zach (206.8 : 206.8) :

In terminal theres no errors>

(speechlib39) piotr@Legion7:~/speechlib/examples$ python3 transcribe.py
/home/piotr/anaconda3/envs/speechlib39/lib/python3.9/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
/home/piotr/anaconda3/envs/speechlib39/lib/python3.9/site-packages/torch_audiomentations/utils/io.py:27: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures
obama_zach.wav is already in WAV format.
obama_zach.wav is already a mono audio file.
The file already has 16-bit samples.
config.yaml: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 292kB/s]pytorch_model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 17.7M/17.7M [00:00<00:00, 19.4MB/s]config.yaml: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 318/318 [00:00<00:00, 36.2kB/s]Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.1.3. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.2.0+cu121. Bad things might happen unless you revert torch to 1.x.
running diarization...
diarization done. Time taken: 17 seconds.
running speaker recognition...
speaker recognition done. Time taken: 4 seconds.
running transcription...
config.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.26k/2.26k [00:00<00:00, 660kB/s]vocabulary.txt: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 460k/460k [00:00<00:00, 1.02MB/s]tokenizer.json: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2.20M/2.20M [00:00<00:00, 3.03MB/s]model.bin: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1.53G/1.53G [00:58<00:00, 26.0MB/s]Cannot check for SPDIF
transcription done. Time taken: 140 seconds.
(speechlib39) piotr@Legion7:~/speechlib/examples$ ls
README.md  audio_cache  logs  obama1.mp3  obama1.wav  obama_zach.wav  preprocess.py  pretrained_models  segments  temp  transcribe.py  voices
(speechlib39) piotr@Legion7:~/speechlib/examples$ python3 transcribe.py
/home/piotr/anaconda3/envs/speechlib39/lib/python3.9/site-packages/pyannote/audio/core/io.py:43: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
/home/piotr/anaconda3/envs/speechlib39/lib/python3.9/site-packages/torch_audiomentations/utils/io.py:27: UserWarning: torchaudio._backend.set_audio_backend has been deprecated. With dispatcher enabled, this function is no-op. You can remove the function call.
  torchaudio.set_audio_backend("soundfile")
torchvision is not available - cannot save figures
obama_zach.wav is already in WAV format.
obama_zach.wav is already a mono audio file.
The file already has 16-bit samples.
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.1.3. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint ../../.cache/torch/pyannote/models--pyannote--segmentation/snapshots/c4c8ceafcbb3a7a280c2d357aee9fbc9b0be7f9b/pytorch_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.1.1. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.2.0+cu121. Bad things might happen unless you revert torch to 1.x.
running diarization...
diarization done. Time taken: 14 seconds.
running speaker recognition...
speaker recognition done. Time taken: 4 seconds.
running transcription...
Cannot check for SPDIF
transcription done. Time taken: 82 seconds.

Content of the file: image

I have python 3.9, clean conda env. Whisper works flawleslly

PiotrEsse avatar Jan 31 '24 18:01 PiotrEsse

  1. did u ran the same example in this repo? if not then post the code.
  2. what is the model size you used?
  3. did you input paths to obama_zach file correctly?
  4. can you run this in normal python environment instead of conda and tell me if error persists

NavodPeiris avatar Feb 01 '24 16:02 NavodPeiris

Ad 1. Yes, Ive run same example, whithout any changes. I use diarize.py ~/speechlib/examples$ python3 transcribe.py

obama_zach_143156_en.txt Ad 2. I use medium Ad 3. Yes - it process the file. It takes time - 79sec to be precisely Ad 4. Sure, Ill have to prepare clean WSL VM.

PiotrEsse avatar Feb 02 '24 13:02 PiotrEsse

This can happen due to a number of reasons because of an insane try/except block in this function.

It literally says:

try:
    trans = transcribe(file, language, modelSize, quantization)  
    
    # return -> [[start time, end time, transcript], [start time, end time, transcript], ..]
    texts.append([segment[0], segment[1], trans])
except:
    pass

I removed this via a monkeypatch and it revealed the actual issue:

ValueError: Requested float16 compute type, but the target device or backend do not support efficient float16 computation.

This is a common issue for faster-whisper and is discussed here: https://github.com/SYSTRAN/faster-whisper/issues/42 There may be a different error in your case.

elia-morrison avatar Apr 09 '24 09:04 elia-morrison

Im having the same problem and it could be solved partially with

https://github.com/NavodPeiris/speechlib/issues/37

In the meantime, i'll try to create a branch in my fork that doesn't use faster-whisper.

tomich avatar May 30 '24 13:05 tomich

I am having an empty file at then end when I use sinhala language , I know in the codebase we are providing a different model for sinhala than normal whisper , Can you please help me with this

Abhishek-cmd13 avatar Jul 16 '24 07:07 Abhishek-cmd13