transformers Dataset features disappear after initlizing Trainer

trafficstars

System Info

transformers version: 4.4.2
Platform: Linux-5.19.0-41-generic-x86_64-with-glibc2.35
Python version: 3.11.3
PyTorch version (GPU?): 2.0.1+cu117 (True)
Tensorflow version (GPU?): not installed (NA)
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@sanchit-gandhi @sgugger

Information

[ ] The official example scripts
[ ] My own modified scripts

Tasks

[ ] An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
[ ] My own task or dataset (give details below)

Reproduction

class CTCTrainer(Trainer):
    def _prepare_inputs(self, inputs: Dict[str, Union[torch.Tensor, Any]]) -> Dict[str, Union[torch.Tensor, Any]]:
        for k, v in inputs.items():
            if isinstance(v, torch.Tensor):
                kwargs = dict(device=self.args.device)
                if self.deepspeed and inputs[k].dtype != torch.int64:
                    kwargs.update(dict(dtype=self.args.hf_deepspeed_config.dtype()))
                inputs[k] = v.to(**kwargs)

            if k == 'labels': # labels are list of tensor, not tensor, special handle here
                for i in range(len(inputs[k])):
                    kwargs = dict(device=self.args.device)
                    if self.deepspeed and inputs[k][i].dtype != torch.int64:
                        kwargs.update(dict(dtype=self.args.hf_deepspeed_config.dtype()))
                    inputs[k][i] = inputs[k][i].to(**kwargs)

        if self.args.past_index >= 0 and self._past is not None:
            inputs["mems"] = self._past

        return inputs

    def training_step(self, model: nn.Module, inputs: Dict[str, Union[torch.Tensor, Any]]) -> torch.Tensor:
        """
        Perform a training step on a batch of inputs.

        Subclass and override to inject custom behavior.

        Args:
            model (:obj:`nn.Module`):
                The model to train.
            inputs (:obj:`Dict[str, Union[torch.Tensor, Any]]`):
                The inputs and targets of the model.

                The dictionary will be unpacked before being fed to the model. Most models expect the targets under the
                argument :obj:`labels`. Check your model's documentation for all accepted arguments.

        Return:
            :obj:`torch.Tensor`: The tensor with training loss on this batch.
        """

        model.train()
        inputs = self._prepare_inputs(inputs)

        if self.use_amp:
            with autocast():
                loss = self.compute_loss(model, inputs)
        else:
            loss = self.compute_loss(model, inputs)

        if self.args.n_gpu > 1:
            loss = loss.mean()

        if self.args.gradient_accumulation_steps > 1:
            loss = loss / self.args.gradient_accumulation_steps

        if self.use_amp:
            self.scaler.scale(loss).backward()
        elif self.use_apex:
            with amp.scale_loss(loss, self.optimizer) as scaled_loss:
                scaled_loss.backward()
        elif self.deepspeed:
            self.deepspeed.backward(loss)
        else:
            loss.backward()

        return loss.detach()

Expected behavior

I am trying to run the codes from https://github.com/TideDancer/interspeech21_emotion I tried my best to recreate the environment.

I am using datasets=1.4.2 transformers=4.4.2

I manually print out the dataset value at each stage to debug. The datasets contains the following features:

Dataset({ features: ['emotion', 'file', 'input_values', 'sampling_rate', 'speech', 'text'],
    num_rows: 507})

The dataset lost all its headers after initializing the Trainer. my test1 is working well but my test2 pops error.

    print(val_dataset[0]['file'])
    print('my test1----------------------------------')    
    
    val_dataset_original = val_dataset

    trainer = CTCTrainer(
        model=model,
        data_collator=data_collator,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=processor.feature_extractor,
    )

    print(val_dataset_original[0]['file'])
    print('my test2----------------------------------')

It then pops KeyError: file. Then I print the dataset again, it turns out only 'input_values' is left.

If this is difficult to reproduce, is there a way I can deep copy the dataset? Because I need the 'file' information to write the output results. I have tried val_dataset_copy=val_dataset but both the dataset variables will be affected by the initialization of the trainer.

Jun 28 '23 23:06 lxrswdd

Yes, the Trainer removes any inputs not accepted by your model our your model won't be able to do a forward pass. You can remove (at your own risk) this by setting remove_unused_coumns in your TrainingArguments to False.

Jun 29 '23 00:06 sgugger

@sgugger Thank you for replying.

So I am facing this problem during model testing, the do_predict part. I still need the feature information from the dataset before the Trainer is initialized. For instance, the file name. So based on your answer above, I am thinking of deep copying the dataset so I can loop through the identical dataset by index to get the information I need while feeding the val_dataset to the Trainer.

I am new to Huggingface, may I know what's the conventional way to do so?


    trainer = CTCTrainer(
        model=model,
        data_collator=data_collator,
        args=training_args,
        compute_metrics=compute_metrics,
        train_dataset=train_dataset,
        eval_dataset=val_dataset,
        tokenizer=processor.feature_extractor,
    )

    # print(val_dataset_original[0]['file'])
    # print('my test2----------------------------------')    

    if last_checkpoint is not None:
        checkpoint = last_checkpoint
    elif model_args.model_name_or_path is not None and os.path.isdir(model_args.model_name_or_path):
        checkpoint = model_args.model_name_or_path
    else:
        checkpoint = None

    if training_args.do_train:
        trainer.train(resume_from_checkpoint=checkpoint)
        trainer.save_model() 

    if training_args.do_predict:
        logger.info('******* Predict ********')

        data_collator.audio_only=True
        predictions, labels, metrics = trainer.predict(val_dataset, metric_key_prefix="predict")
        logits_ctc, logits_cls = predictions
        pred_ids = np.argmax(logits_cls, axis=-1)
        pred_probs = F.softmax(torch.from_numpy(logits_cls).float(), dim=-1)
        print(val_dataset)
        with open(data_args.output_file, 'w') as f:
            for i in range(len(pred_ids)):
                f.write(val_dataset[i]['file'].split("/")[-1] + " " + str(len(val_dataset[i]['input_values'])/16000) + " ")
                pred = pred_ids[i]
                f.write(str(pred)+' ')
                for j in range(4):
                    f.write(' ' + str(pred_probs[i][j].item()))
                f.write('\n')
        f.close()

Jun 29 '23 00:06 lxrswdd

Hey @lxrswdd - I see the _prepare_inputs method that you've overridden in the Trainer class is purely to get your dataset in the right format for the model

What you're probably better off doing here is pre-processing your dataset ahead of time, transforming the raw audio values to normalised model input values using an appropriate feature extractor. You can do this quite straightforwardly using 🤗 Datasets's .map method

Once you have your pre-processed input values, you can collate them into batches by defining an appropriate data collator. We have several end-to-end examples that will perform the pre-processing a collate steps for you: all you need to do is switch the dataset id for your dataset on the Hub. See examples/speech-recognition for details

Likewise, you can follow this excellent blog post for fine-tuning a CTC system with the 🤗 Trainer API: https://huggingface.co/blog/fine-tune-wav2vec2-english

The only real engineering work you'll have to do if you follow these guides is getting your dataset in the right format, for which you can follow this page: https://huggingface.co/docs/datasets/audio_dataset

Jun 29 '23 17:06 sanchit-gandhi

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Jul 29 '23 15:07 github-actions[bot]

transformers transformers copied to clipboard

Dataset features disappear after initlizing Trainer

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

transformers
transformers copied to clipboard