CTranslate2 Whisper encodings differ in batch vs serial

Whisper encodings differ in batch vs serial

Open ExarchD opened this issue 5 months ago • 0 comments

I've been examining the encoded output of whisper and I see that the results are different when the same input is sent in via batch or one-by-one.

I made a test script that examines this, where the same features [80,3000] are sent in 5 times individually, then sent in as a [5,80,3000] tensor. The encoded objects that come out are not identical. Does anyone know why?

import transformers
import ctranslate2
import torchaudio
import numpy as np
import torch

tmp_dir = 'tmp/'
device = 'cuda'

audio_path = "test.wav"
audio, sr = torchaudio.load(audio_path)
audio = audio[0]

model_name = "openai/whisper-large-v2"
converter = ctranslate2.converters.TransformersConverter(model_name)
output_dir = converter.convert(tmp_dir)


model = ctranslate2.models.Whisper(output_dir, device=device)
processor = transformers.WhisperProcessor.from_pretrained(model_name)
inputs = processor(audio, sampling_rate=16000)
features = inputs.input_features[0]
features = np.expand_dims(features, 0)

comb_feats = []
batch_size = 5
i = 0
while i < batch_size:
    comb_feats.append(features[0])
    i += 1

feats = np.stack(comb_feats)
feats = ctranslate2.StorageView.from_array(feats)
batch_encoded = model.encode(feats)
batch_encoded = torch.as_tensor(batch_encoded, device='cuda')

i = 0
serial_encoded = []
while i < batch_size:
    feats = ctranslate2.StorageView.from_array(features)
    enc = torch.as_tensor(model.encode(feats), device='cuda')[0]
    serial_encoded.append(enc)
    i += 1

serial_encoded = torch.stack(serial_encoded)

print(batch_encoded.shape)
print(serial_encoded.shape)
print((batch_encoded==serial_encoded).all())

Jan 19 '24 22:01 ExarchD

CTranslate2 CTranslate2 copied to clipboard

Whisper encodings differ in batch vs serial

CTranslate2
CTranslate2 copied to clipboard