transformers
transformers copied to clipboard
Dataset features disappear after initlizing Trainer
System Info
transformersversion: 4.4.2- Platform: Linux-5.19.0-41-generic-x86_64-with-glibc2.35
- Python version: 3.11.3
- PyTorch version (GPU?): 2.0.1+cu117 (True)
- Tensorflow version (GPU?): not installed (NA)
- Using GPU in script?:
- Using distributed or parallel set-up in script?:
Who can help?
@sanchit-gandhi @sgugger
Information
- [ ] The official example scripts
- [ ] My own modified scripts
Tasks
- [ ] An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - [ ] My own task or dataset (give details below)
Reproduction
class CTCTrainer(Trainer):
def _prepare_inputs(self, inputs: Dict[str, Union[torch.Tensor, Any]]) -> Dict[str, Union[torch.Tensor, Any]]:
for k, v in inputs.items():
if isinstance(v, torch.Tensor):
kwargs = dict(device=self.args.device)
if self.deepspeed and inputs[k].dtype != torch.int64:
kwargs.update(dict(dtype=self.args.hf_deepspeed_config.dtype()))
inputs[k] = v.to(**kwargs)
if k == 'labels': # labels are list of tensor, not tensor, special handle here
for i in range(len(inputs[k])):
kwargs = dict(device=self.args.device)
if self.deepspeed and inputs[k][i].dtype != torch.int64:
kwargs.update(dict(dtype=self.args.hf_deepspeed_config.dtype()))
inputs[k][i] = inputs[k][i].to(**kwargs)
if self.args.past_index >= 0 and self._past is not None:
inputs["mems"] = self._past
return inputs
def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) -> torch.Tensor:
"""
Perform a training step on a batch of inputs.
Subclass and override to inject custom behavior.
Args:
model (:obj:`nn.Module`):
The model to train.
inputs (:obj:`Dict[str, Union[torch.Tensor, Any]]`):
The inputs and targets of the model.
The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
argument :obj:`labels`. Check your model's documentation for all accepted arguments.
Return:
:obj:`torch.Tensor`: The tensor with training loss on this batch.
"""
model.train()
inputs = self._prepare_inputs(inputs)
if self.use_amp:
with autocast():
loss = self.compute_loss(model, inputs)
else:
loss = self.compute_loss(model, inputs)
if self.args.n_gpu > 1:
loss = loss.mean()
if self.args.gradient_accumulation_steps > 1:
loss = loss / self.args.gradient_accumulation_steps
if self.use_amp:
self.scaler.scale(loss).backward()
elif self.use_apex:
with amp.scale_loss(loss, self.optimizer) as scaled_loss:
scaled_loss.backward()
elif self.deepspeed:
self.deepspeed.backward(loss)
else:
loss.backward()
return loss.detach()
Expected behavior
I am trying to run the codes from https://github.com/TideDancer/interspeech21_emotion I tried my best to recreate the environment.
I am using datasets=1.4.2 transformers=4.4.2
I manually print out the dataset value at each stage to debug. The datasets contains the following features:
Dataset({ features: ['emotion', 'file', 'input_values', 'sampling_rate', 'speech', 'text'],
num_rows: 507})
The dataset lost all its headers after initializing the Trainer. my test1 is working well but my test2 pops error.
print(val_dataset[0]['file'])
print('my test1----------------------------------')
val_dataset_original = val_dataset
trainer = CTCTrainer(
model=model,
data_collator=data_collator,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=processor.feature_extractor,
)
print(val_dataset_original[0]['file'])
print('my test2----------------------------------')
It then pops KeyError: file. Then I print the dataset again, it turns out only 'input_values' is left.
If this is difficult to reproduce, is there a way I can deep copy the dataset? Because I need the 'file' information to write the output results.
I have tried val_dataset_copy=val_dataset but both the dataset variables will be affected by the initialization of the trainer.
Yes, the Trainer removes any inputs not accepted by your model our your model won't be able to do a forward pass. You can remove (at your own risk) this by setting remove_unused_coumns in your TrainingArguments to False.
@sgugger Thank you for replying.
So I am facing this problem during model testing, the do_predict part.
I still need the feature information from the dataset before the Trainer is initialized. For instance, the file name.
So based on your answer above, I am thinking of deep copying the dataset so I can loop through the identical dataset by index to get the information I need while feeding the val_dataset to the Trainer.
I am new to Huggingface, may I know what's the conventional way to do so?
trainer = CTCTrainer(
model=model,
data_collator=data_collator,
args=training_args,
compute_metrics=compute_metrics,
train_dataset=train_dataset,
eval_dataset=val_dataset,
tokenizer=processor.feature_extractor,
)
# print(val_dataset_original[0]['file'])
# print('my test2----------------------------------')
if last_checkpoint is not None:
checkpoint = last_checkpoint
elif model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path):
checkpoint = model_args.model_name_or_path
else:
checkpoint = None
if training_args.do_train:
trainer.train(resume_from_checkpoint=checkpoint)
trainer.save_model()
if training_args.do_predict:
logger.info('******* Predict ********')
data_collator.audio_only=True
predictions, labels, metrics = trainer.predict(val_dataset, metric_key_prefix="predict")
logits_ctc, logits_cls = predictions
pred_ids = np.argmax(logits_cls, axis=-1)
pred_probs = F.softmax(torch.from_numpy(logits_cls).float(), dim=-1)
print(val_dataset)
with open(data_args.output_file, 'w') as f:
for i in range(len(pred_ids)):
f.write(val_dataset[i]['file'].split("/")[-1] + " " + str(len(val_dataset[i]['input_values'])/16000) + " ")
pred = pred_ids[i]
f.write(str(pred)+' ')
for j in range(4):
f.write(' ' + str(pred_probs[i][j].item()))
f.write('\n')
f.close()
Hey @lxrswdd - I see the _prepare_inputs method that you've overridden in the Trainer class is purely to get your dataset in the right format for the model
What you're probably better off doing here is pre-processing your dataset ahead of time, transforming the raw audio values to normalised model input values using an appropriate feature extractor. You can do this quite straightforwardly using 🤗 Datasets's .map method
Once you have your pre-processed input values, you can collate them into batches by defining an appropriate data collator. We have several end-to-end examples that will perform the pre-processing a collate steps for you: all you need to do is switch the dataset id for your dataset on the Hub. See examples/speech-recognition for details
Likewise, you can follow this excellent blog post for fine-tuning a CTC system with the 🤗 Trainer API: https://huggingface.co/blog/fine-tune-wav2vec2-english
The only real engineering work you'll have to do if you follow these guides is getting your dataset in the right format, for which you can follow this page: https://huggingface.co/docs/datasets/audio_dataset
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.
Please note that issues that do not follow the contributing guidelines are likely to be ignored.