Improve error handling: No active speech found in audio
use openlrc version: 1.5.2
When try to transcribe a video that have no human voice, will get exception RuntimeError: stack expects a non-empty TensorList.
I found the following text in log:
[2024-09-19 22:48:52] INFO [Producer_0] Audio length: /home/user00/gitspace/video_tools/.data/no-speech/preprocessed/no-speech_preprocessed.wav: 00:25:14,243
No active speech found in audio
Is it possible for openlrc to handle this situation and end the transcription task early? Generating an empty subtitle file and return its path as usual, which may be a reasonable way to deal with it.
2024-09-19 22:48:16.532 | INFO | video_tools.transcribe.base_transcriber:preview:93 - preview transcribe task:
TranscribeMetadata(
│ params=TranscribeParams(model='tiny', device='cpu', compute_type='int8'),
│ audios=[
│ │ AudioMetadata(path=PosixPath('/home/user00/gitspace/video_tools/.data/no-speech/no-speech.mp4'), hash='6e8b9718e3f6c6f60be6c25f766e3da885995f557d541989a341896feff6d505', subtitle=None, error=None)
│ ]
)
2024-09-19 22:48:16.622 | INFO | video_tools.transcribe.base_transcriber:preview:95 - total audios num: 1
Do you want to continue? [y/N]: y
Lightning automatically upgraded your loaded checkpoint from v1.5.4 to v2.4.0. To apply the upgrade to your files permanently, run `python -m pytorch_lightning.utilities.upgrade_checkpoint .venv/lib/python3.11/site-packages/faster_whisper/assets/pyannote_vad_model.bin`
Model was trained with pyannote.audio 0.0.1, yours is 3.3.2. Bad things might happen unless you revert pyannote.audio to 0.x.
Model was trained with torch 1.10.0+cu102, yours is 2.2.2+cu121. Bad things might happen unless you revert torch to 1.x.
[2024-09-19 22:48:18] INFO [MainThread] File /home/user00/gitspace/video_tools/.data/no-speech/no-speech.mp4: Audio sample rate: 44100
[2024-09-19 22:48:19] INFO [MainThread] Loudness normalizing...
[2024-09-19 22:48:19] INFO [MainThread] Normalizing file no-speech.wav (1 of 1)
[2024-09-19 22:48:19] INFO [MainThread] Running first pass loudnorm filter for stream 0
[2024-09-19 22:48:48] INFO [MainThread] Running second pass for /home/user00/gitspace/video_tools/.data/no-speech/no-speech.wav
[2024-09-19 22:48:52] INFO [MainThread] Normalized file written to /home/user00/gitspace/video_tools/.data/no-speech/preprocessed/no-speech_ln.wav
[2024-09-19 22:48:52] INFO [MainThread] Preprocessed audio saved to /home/user00/gitspace/video_tools/.data/no-speech/preprocessed/no-speech_preprocessed.wav
[2024-09-19 22:48:52] INFO [MainThread] Working on 1 audio files: [PosixPath('/home/user00/gitspace/video_tools/.data/no-speech/preprocessed/no-speech_preprocessed.wav')]
[2024-09-19 22:48:52] INFO [MainThread] Start Transcription (Producer) and Translation (Consumer) process
[2024-09-19 22:48:52] INFO [Producer_0] Start Transcription process
[2024-09-19 22:48:52] INFO [Producer_0] Audio length: /home/user00/gitspace/video_tools/.data/no-speech/preprocessed/no-speech_preprocessed.wav: 00:25:14,243
No active speech found in audio
[2024-09-19 22:49:24] INFO [Producer_0] Detected language: en (0.58) in first 30s of audio...
[2024-09-19 22:49:24] INFO [Producer_0] Transcription process Elapsed: 31.53s
[2024-09-19 22:49:24] INFO [MainThread] Transcription (Producer) and Translation (Consumer) process Elapsed: 31.53s
Traceback (most recent call last):
File "/home/user00/gitspace/video_tools/video_tools/main.py", line 6, in <module>
fire.Fire(OpenLRCTranscriber)
File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/fire/core.py", line 143, in Fire
component_trace = _Fire(component, args, parsed_flag_args, context, name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/fire/core.py", line 477, in _Fire
component, remaining_args = _CallAndUpdateTrace(
^^^^^^^^^^^^^^^^^^^^
File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/fire/core.py", line 693, in _CallAndUpdateTrace
component = fn(*varargs, **kwargs)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/user00/gitspace/video_tools/video_tools/transcribe/base_transcriber.py", line 125, in run
return self._transcribe()
^^^^^^^^^^^^^^^^^^
File "/home/user00/gitspace/video_tools/video_tools/transcribe/openlrc_transcriber.py", line 13, in _transcribe
return self._lrcer.run(self._audios, skip_trans=True, clear_temp=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/openlrc/openlrc.py", line 370, in run
producer.result()
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 456, in result
return self.__get_result()
^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/concurrent/futures/_base.py", line 401, in __get_result
raise self._exception
File "/usr/lib/python3.11/concurrent/futures/thread.py", line 58, in run
result = self.fn(*self.args, **self.kwargs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/openlrc/openlrc.py", line 122, in produce_transcriptions
segments, info = self.transcriber.transcribe(audio_path, language=src_lang)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/openlrc/transcribe.py", line 81, in transcribe
seg_gen, info = self.whisper_model.transcribe(str(audio_path), language=language, **self.asr_options)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/faster_whisper/transcribe.py", line 523, in transcribe
features = torch.stack(
^^^^^^^^^^^^
RuntimeError: stack expects a non-empty TensorList
There is an existing PR for Faster-Whisper to implement early stopping for non-voice audio, which can be found at https://github.com/SYSTRAN/faster-whisper/pull/1014. Until it's merged, there seems to be no straightforward solution to stop it early without adding an extra VAD, which is computationally intensive and unnecessary for most of users.
As a workaround, you could try implementing voice detection using the pyannote on your local machine before sending the audio to openlrc.
It should be fixed with the latest version of Faster-Whisper in v1.6.0. Please reopen it if the issue persists.
Thank you for following this issue and releasing version 1.6.0.
In fast-whisper==1.1.0(with openlrc==1.6.0), VadOptions has member named onset(not threshold)
# https://github.com/SYSTRAN/faster-whisper/blob/v1.1.0/faster_whisper/vad.py#L37
class VadOptions:
onset: float = 0.5
Thus cause following error:
File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/openlrc/openlrc.py", line 122, in produce_transcriptions
segments, info = self.transcriber.transcribe(audio_path, language=src_lang)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/openlrc/transcribe.py", line 86, in transcribe
seg_gen, info = self.whisper_model.transcribe(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/user00/gitspace/video_tools/.venv/lib/python3.11/site-packages/faster_whisper/transcribe.py", line 404, in transcribe
vad_parameters = VadOptions(
^^^^^^^^^^^
TypeError: VadOptions.__init__() got an unexpected keyword argument 'threshold'
Specifying dependencies as the following commit versions can solve this problem.
faster-whisper = { url = "https://github.com/SYSTRAN/faster-whisper/archive/8327d8cc647266ed66f6cd878cf97eccface7351.tar.gz" }
Thanks! I've updated this dependency.