transformers icon indicating copy to clipboard operation
transformers copied to clipboard

Support for OrderedConstraints, TemplateConstraints and LiteralConstraints in force_words_ids

Open ruanchaves opened this issue 2 years ago • 5 comments

Feature request

As raised by @sijunhe in this blog post, the force_words_ids argument of the model.generate() method needs to be modified to support OrderedConstraints and TemplateConstraints.

In addition, there is a need for a LiteralConstraints subclass. This would enable generating exactly the same list of tokens given in the force_words_ids argument, which would in turn allow for the calculation of sentence perplexity across all language models in the library by making use of the attribute implemented in this PR.

Motivation

Currently, there is no standard way of calculating sentence perplexity and implementing it requires a lot of boilerplate code, which may not always work as intended. Third-party libraries such as lm-scorer, which implemented this functionality, are no longer maintained and do not support all language models in the library.

Your contribution

I would be interested in working on this PR as I'm the maintainer of a third-party library ( hashformers ) that performs sentence perplexity calculations with the Transformers library.

ruanchaves avatar Jan 20 '23 15:01 ruanchaves

cc @gante

sgugger avatar Jan 20 '23 15:01 sgugger

Hi @ruanchaves 👋

I'm not sure whether I understand the issue you described above. Our generation methods return the sequence log probabilities, from which you can compute the sequence perplexity. What would be missing for your use case?

Regarding force_words_ids, I'm reluctant to add more features there -- it has low usage and a high maintenance cost. I might reconsider my position here if I see more demand for further functionality :)

gante avatar Jan 24 '23 12:01 gante

Olá @gante !

I'm not sure whether I understand the issue you described above. Our generation methods return the sequence log probabilities, from which you can compute the sequence perplexity.

True, but I want the sequence log probabilities for a predefined sequence. I already have a sequence of tokens and I want the model to calculate its perplexity. I don't want the perplexity of a sequence generated through beamsearch or greedy search.

When lm_scorer was conceived, there was no straightforward way to do this with transformers:

# Return token probabilities (provide log=True to return log probabilities)
scorer.tokens_score("I like this package.")
# => (scores, ids, tokens)
# scores = [0.018321, 0.0066431, 0.080633, 0.00060745, 0.27772, 0.0036381]
# ids    = [40,       588,       428,      5301,       13,      50256]
# tokens = ["I",      "Ä like",   "Ä this",  "Ä package", ".",     "<|endoftext|>"]

Is this still the case? I hope you can point me in the right direction if new features were added since lm_scorer was released.

Regarding force_words_ids, I'm reluctant to add more features there -- it has low usage and a high maintenance cost. I might reconsider my position here if I see more demand for further functionality :)

I get it, but being able to calculate the perplexity of a predefined sequence sounds like an essential feature to me, regardless of where it is implemented.

ruanchaves avatar Jan 24 '23 17:01 ruanchaves

Hey @ruanchaves 👋

Yeah, we lack an easy interface to compute the logits of existing sentences, and that's something I really like to add ASAP! I'm planning to add it within the next month, but if you'd like to give me a hand you'd be more than welcome 🙌

The planned interface is

log_scores = model.compute_token_scores(tokens, normalize_logits)

where tokens is the tokenized input (so it can be used in different modalities) and normalize_logits is an optional boolean (defaulting to true) to control whether we want to renormalize the model logits

gante avatar Jan 24 '23 19:01 gante

@gante ,

Yeah, we lack an easy interface to compute the logits of existing sentences, and that's something I really like to add ASAP! I'm planning to add it within the next month, but if you'd like to give me a hand you'd be more than welcome 🙌

Good! This would close the issue for me, as it's the thing I'm actually looking for. I'll be watching your PRs and see if I can contribute somehow.

Suggestion: consider adding the compute_token_scores method to masked language models as well. This has been implemented a few years ago at awslabs/mlm-scoring, but just like lm-scorer, it's no longer maintained.

ruanchaves avatar Jan 25 '23 11:01 ruanchaves

Have there been updates on the implementations of OrderedConstraints and TemplateConstraints? I find myself needing both.

ahmed-moubtahij avatar Jan 04 '24 18:01 ahmed-moubtahij

Hi @Ayenem 👋

No developments, our team is out of bandwidth to expand Constraints at the moment :)

gante avatar Jan 10 '24 15:01 gante

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

github-actions[bot] avatar Feb 04 '24 08:02 github-actions[bot]