transformers Return attention_mask in FeatureExtractionPipeline output

Feature request

Return attention_mask as one output of the FeatureExtractionPipeline so that padding token embeddings can be ignored.

Motivation

Who can help? @Narsil

When using the FeatureExtractionPipeline to generate sentence embeddings, the input to the pipeline processes a raw sentence with a tokenizer. The output of the pipeline is a tensor of shape [1, seq_len, hidden_dim]. If the input is padded, seq_len is equal to the max_length of the tokenizer or longest seq in the batch.

However, when performing mean pooling of individual word embeddings to obtain the sentence embedding, one may want to use attention_mask in order to ignore the padding token embeddings (see the mean pooling example below). But, FeatureExtractionPipeline does not return attention_mask as part of its output.

#Mean Pooling - Take attention mask into account for correct averaging
def mean_pooling(model_output, attention_mask):
    token_embeddings = model_output[0] #First element of model_output contains all token embeddings
    input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
    sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
    sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
    return sum_embeddings / sum_mask

Your contribution

I can submit a pull request to the issue if it sounds good to you!

Mar 13 '23 15:03 anruijian

This doesn't seem like a use-case for the pipeline though. Since you want access to the process inputs, you should just used the tokenizer and the model directly.

Mar 13 '23 15:03 sgugger

Your comment makes sense. As my goal aligns with the pipeline's main functionality, I think I will subclass FeatureExtractionPipeline and make small modifications to achieve my goal. Feel free to close the issue. Thank you!

Mar 13 '23 15:03 anruijian

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

Apr 13 '23 15:04 github-actions[bot]

transformers transformers copied to clipboard

Return attention_mask in FeatureExtractionPipeline output

Feature request

Motivation

Your contribution

transformers
transformers copied to clipboard