whisper icon indicating copy to clipboard operation
whisper copied to clipboard

Transcribe on GPU

Open take0x opened this issue 1 year ago • 5 comments

Idealy, log_mel_spectrogram() should use model.device when transcribing.

take0x avatar Sep 09 '24 10:09 take0x

benchmarks ?

ExtReMLapin avatar Sep 20 '24 04:09 ExtReMLapin

I am not sure that this is a valuable change.

While it is not a robust benchmark, I did do an experiment on my local machine. 10x log_mel_spectrogram() on a 177min audio file:

CPU:
  mean: 0.948
  std_dev: 0.208

GPU:
  mean: 2.67
  std_dev: 1.20

Machine specs:

  • CPU: i5-13500HX
  • GPU: GeForce RTX 4050 Laptop GPU

Note: The audio takes ~30s to load and ~330s to transcribe, so the difference of one or two seconds seems largely moot regardless.

kittsil avatar Sep 21 '24 20:09 kittsil

What is important is that the device specified in load_model() should be used when transcribing, rather than the actual benchmark result.

take0x avatar Sep 22 '24 13:09 take0x

@take0x, I was using my GPU to transcribe.

What is important is that the device specified in load_model() should be used when transcribing, rather than the actual benchmark result.

The device specified is used to transcribe. The log_mel_spectrogram() computation, which is a preprocessing step and NOT part of the NN model, defaults to using the CPU.

I think most consumers of the code would say "the fastest device available should be used to create the mel spectrogram." Given the nature of the computation, a CPU's almost always going to be the faster device (and should therefore be the default), despite the device on which the NN (a very different computation) runs.

You're more likely to get a PR approved if it included an optional mel_spectrogram_device parameter that allows that computation to be run on a specific device, but even then... I'm not sure this has much value compared to the noise of adding another parameter.

kittsil avatar Sep 22 '24 17:09 kittsil

@kittsil Thank you for your advice.

In my case, when transcribing large amounts of audio data, there have been cases where the process crashed on the CPU but could be processed normally on the GPU. I think it would be useful to be able to transcribe using devices other than a CPU.

I'll try adding the mel_spectrogram_device parameter based on your advice.

take0x avatar Sep 22 '24 22:09 take0x