transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Possibility to access initial indices of the data during the Training

Open franz101 opened this issue 3 years ago • 6 comments

Feature request

Scenario 1: For a specific logging tasks, we need the indices of the data, therefore we save them in our dataset. Therefore we want to disable remove_unused_columns before we logged our data with indices.

Motivation

Right now this needs to be set in the trainer arguments beforehand. What is the best practice for logging indices of the dataset during the forward step?

Your contribution

I can submit an integration of a logger with more data centric focus then just logging training performance metrics.

franz101 avatar Sep 16 '22 02:09 franz101

Hi there! The datasets attribute of the Trainer are never modified, so they always retain all their columns. You therefore don't need to change the training arguments (which is something the Trainer is not allowed to do by the way, otherwise its logs are not accurate and we can't reproduce the same results easily).

sgugger avatar Sep 16 '22 11:09 sgugger

Thanks Sylvain @sgugger for your quick reply.

To clarify the issue when writing a callback integration for data monitoring: For some monitoring, the embedding and original row indices (to identify the text) is needed to be logged during on_step_end. An example implementation in PyTorch (see the bottom of the forward function):

def forward(self, x, attention_mask, idxs):
        """Model forward function."""
        embedding = self.feature_extractor(
            input_ids=x, attention_mask=attention_mask
        ).last_hidden_state[:, 0]
    
        emb = self.pre_step(embedding)
        emb = self.relu(emb)
        emb = self.dropout(emb)
        logits = self.classifier(emb)

        # The logging function that is moved to a callback.
        logging_function(
            embs= embedding, logits=logits, indices=idxs
        )

According to the Trainer, if remove unused columns is enabled, it would overwrite the training dataset in the class https://github.com/huggingface/transformers/blob/16242e1bf07450c5dc39fe64fbc810c877455519/src/transformers/trainer.py#L844

So the feature I'm trying to specify is to give integrations the logging capabilities and the user the least parameter change overhead. I couldn't find a lot of discussions about preserving or accessing the initial indices during the forward step except this: https://discuss.pytorch.org/t/how-does-one-obtain-indicies-from-a-dataloader/16847/7

The embedding or logits are straightforward with the register_forward_hook api though

franz101 avatar Sep 16 '22 16:09 franz101

I am very confused as to how letting the callback change the training arguments would help in this instance. By the time you arrive at the model, the dataloader has been built. So extra args have been removed (or not) and changing the training arguments won't do anything.

sgugger avatar Sep 16 '22 16:09 sgugger

Yes, I see. The initial thought was the user needs to just populate the report_to flag. But as you have said access wise and order wise, when dealing with adding/accessing indices to/of the data, it's not possible at the callback level without previous modifications of the dataset itself. It's necessary to set remove_unused_columns to false and use the data collater fn to deal with the indices. Am I correct? I hope this clears some confusion :D

franz101 avatar Sep 16 '22 16:09 franz101

Most likely the model itself, from what you shared. Normally data collators collate what they get (as long as it's in a "collatable" type). As you also pointed out, you can use a forward hook without needing to rewrite the model class.

sgugger avatar Sep 16 '22 16:09 sgugger

Though if modify the dataset to add the idices: row_len = len(ds["train"]) ds["train"] = ds["train"].add_column("idx",list(range(row_len)))

the unmodified forward function will throw an error. As the input then is: ['text', 'label', 'idx', 'input_ids', 'attention_mask']

instead of

['label', 'input_ids', 'attention_mask']

So to get back to your reply I will double check if I can add a further parameter with the forward hook to the forward function.

franz101 avatar Sep 16 '22 16:09 franz101

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Oct 16 '22 15:10 github-actions[bot]