Issue with span_corruption preprocessors

Open mondonomo opened this issue 3 years ago • 0 comments

Hi, I'm trying to pretrain byt5 on the custom corpora (of short texts), but I'm stuck with the data pipeline (the code is below). When I decode the outputs, inputs and targets are merged from the different examples, and both are noised.

DEFAULT_OUTPUT_FEATURES = {
    "inputs": seqio.Feature(vocabulary=seqio.ByteVocabulary(), add_eos=True),
    "targets": seqio.Feature(vocabulary=seqio.ByteVocabulary(), add_eos=True),
}

MEAN_NOISE_SPAN_LENGTH = 5
SEQUENCE_LENGTH = sequence_length={"inputs": 128, "targets": 128}


seqio.TaskRegistry.add(
    name="nelma_byt5",
    source=seqio.TextLineDataSource(split_to_filepattern={
            "train": "/disk1/projekti/mondodb_lm/test.tsv",
        }),
    preprocessors=[
        functools.partial(
          t5.data.preprocessors.parse_tsv,
          field_names=['text','class'],
          field_delim='\t',
        ),
        functools.partial(
              seqio.preprocessors.rekey,
              key_map={"inputs": None, "targets": "text"}
        ),
        seqio.preprocessors.tokenize,
        seqio.CacheDatasetPlaceholder(),
        functools.partial(
          t5.data.preprocessors.span_corruption,
          mean_noise_span_length=MEAN_NOISE_SPAN_LENGTH),
        seqio.preprocessors.append_eos_after_trim,
     ],
      output_features=DEFAULT_OUTPUT_FEATURES,
      metric_fns=[])

May 15 '22 15:05 mondonomo