transfer-learning-conv-ai icon indicating copy to clipboard operation
transfer-learning-conv-ai copied to clipboard

Batchwise padding dataset

Open mrghofrani opened this issue 3 years ago • 0 comments

Hello I'm pretty new to Pytorch so sorry if this question was so simple. Because of memory limits, I can't pad my dataset as a whole. So I was wondering what is the simplest way to move the pad_dataset function into the training process, I mean how can I pad the dataset in a batch? For ease of reference, I added the pad_dataset below. Thanks.

def pad_dataset(dataset, padding=0):
    """ Pad the dataset. This could be optimized by defining a Dataset class and padding at the batch level, but this is simpler. """
    max_l = max(len(x) for x in dataset["input_ids"])
    for name in PADDED_INPUTS:
        dataset[name] = [x + [padding if name != "lm_labels" else -100] * (max_l - len(x)) for x in dataset[name]]
    return dataset

mrghofrani avatar Jul 11 '22 18:07 mrghofrani