smashed
smashed copied to clipboard
Add PrefixSuffix multiseq mappers to prepend/append tokens
Hi! With the introduction of Smashed, munging datasets of long documents is going to be a lot more fun )
This draft PR is simply to explore the idea below. It does the following:
- It adds a new
CustomTokensSequencePaddingMapper
with corresponding classes fortype_ids
andattnetion_mask
that allow wrapping the sentences with custom ids or strings. - It abstracts away the
SequencePaddingMapper
to do the general job of adding prefix/suffix tokens depending on the sentence number
In (1), we might want to prepend strings because Text2Text models like T5 expect inputs to have Task prefixes. And we might want to bound sentences with custom special token_ids (e.g., with tokenizer added special tokens) to indicate the type of sentences in the dataset column.
Because of (2), subclasses now do not have to implement the transform
function and only define what prefix/suffix tokens to add.
On the pro side, it reduces code duplication (especially considering new CustomPadding classes) and unifies the classes. But on the con side, we now have one more level of inheritance...
If some variation of this proposal fits, I can add docs and tests.
@soldni