byt5
byt5 copied to clipboard
Issue with span_corruption preprocessors
Hi, I'm trying to pretrain byt5 on the custom corpora (of short texts), but I'm stuck with the data pipeline (the code is below). When I decode the outputs, inputs and targets are merged from the different examples, and both are noised.
DEFAULT_OUTPUT_FEATURES = {
"inputs": seqio.Feature(vocabulary=seqio.ByteVocabulary(), add_eos=True),
"targets": seqio.Feature(vocabulary=seqio.ByteVocabulary(), add_eos=True),
}
MEAN_NOISE_SPAN_LENGTH = 5
SEQUENCE_LENGTH = sequence_length={"inputs": 128, "targets": 128}
seqio.TaskRegistry.add(
name="nelma_byt5",
source=seqio.TextLineDataSource(split_to_filepattern={
"train": "/disk1/projekti/mondodb_lm/test.tsv",
}),
preprocessors=[
functools.partial(
t5.data.preprocessors.parse_tsv,
field_names=['text','class'],
field_delim='\t',
),
functools.partial(
seqio.preprocessors.rekey,
key_map={"inputs": None, "targets": "text"}
),
seqio.preprocessors.tokenize,
seqio.CacheDatasetPlaceholder(),
functools.partial(
t5.data.preprocessors.span_corruption,
mean_noise_span_length=MEAN_NOISE_SPAN_LENGTH),
seqio.preprocessors.append_eos_after_trim,
],
output_features=DEFAULT_OUTPUT_FEATURES,
metric_fns=[])