audio-transformers-course
audio-transformers-course copied to clipboard
Error in DataCollatorSpeechSeq2SeqWithPadding (Unit 5)
In the unit 5 of the audio course, the following code is used:
class DataCollatorSpeechSeq2SeqWithPadding:
processor: Any
def __call__(
self, features: List[Dict[str, Union[List[int], torch.Tensor]]]
) -> Dict[str, torch.Tensor]:
# split inputs and labels since they have to be of different lengths and need different padding methods
# first treat the audio inputs by simply returning torch tensors
input_features = [
{"input_features": feature["input_features"][0]} for feature in features
]
batch = self.processor.feature_extractor.pad(input_features, return_tensors="pt")
# get the tokenized label sequences
label_features = [{"input_ids": feature["labels"]} for feature in features]
# pad the labels to max length
labels_batch = self.processor.tokenizer.pad(label_features, return_tensors="pt")
# replace padding with -100 to ignore loss correctly
labels = labels_batch["input_ids"].masked_fill(
labels_batch.attention_mask.ne(1), -100
)
# if bos token is appended in previous tokenization step,
# cut bos token here as it's append later anyways
if (labels[:, 0] == self.processor.tokenizer.bos_token_id).all().cpu().item():
labels = labels[:, 1:]
batch["labels"] = labels
return batch
However, according to the following issue, bos_token_id
shouldn't be used (@ArthurZucker). In my opinion, this should be replaced with self.processor.tokenizer.convert_tokens_to_ids("<|startoftranscript|>")
or with model.config.decoder_start_token_id
. What do you think?
Note if this is true, then there would be a similar error in @sanchit-gandhi's fine-tuning tutorial too.
Thanks for your attention.
Regards, Tony
If the self.processor.tokenizer.bos_token_id
is correctly set, (it should not be used in the sense that it is not used, of forced_decoder_ids is set it will be taken, instead of this one) then I don't really see a problem
If you look at the issue I mentioned, it seems that you stated the opposite. Here's what you wrote for context:
I'll update the documentation to make it less confusing. The token used to store the
"<|startoftranscript|>"
token isdecoder_start_token_id
. Thebos_token
is pretty much unused, which is why it was set to the same aseos_token
.
I might be wrong. But forced_decoder_ids
is only used with generate
and not with forward
. And I checked in the forward
source code: with the current collator, the batch["labels"]
would always start with <|startoftranscript|>
. However, a redundant <|startoftranscript|>
would also be appended when using forward
with the labels
argument and when decoder_input_ids is None
(see source code). Let me know if you think I'm wrong.
cc @sanchit-gandhi