minimal-text-diffusion Training on a large dataset is not working.

Hey there, 😊 I want to train the DLM with a larger dataset ~17.000.000 sentences and this totally explodes my memory even though I have 200GB available. As far as I can assess, the whole training set is being tokenized in the beginning which causes the problem. Is there already a solution for it or are you aware of this problem? This line causes the problem.

Mar 02 '23 14:03 mainpyp

Move the tokenisation to __getitem__

May 11 '23 09:05 thorinf

this runs the tokenizer in getitem, saving a bunch of memory. then the length cropping should also work such that can assign to tokens[i, :length] correctly if you have long sentences. There is a way to load the text in the getitem which saves even more memory, but may not be necesary, let me know if you need it.

    def __getitem__(self, i):
        encoded_input = self.tokenizer(self.text[i])
        input_ids = encoded_input["input_ids"]
        out_dict = {
            "input_ids": input_ids,
            # "input_ids": self.input_ids[i],
            # "attention_mask": [1] * len(self.input_ids[i]),
        }
        if hasattr(self, "labels"):
            out_dict["label"] = self.labels[i]
        return out_dict

    @staticmethod
    def collate_pad(batch, cutoff: int):
        max_token_len = 0
        num_elems = len(batch)
        # batch[0] -> __getitem__[0] --> returns a tuple (embeddings, out_dict)

        for i in range(num_elems):
            max_token_len = max(max_token_len, len(batch[i]["input_ids"]))

        max_token_len = min(cutoff, max_token_len)

        tokens = torch.zeros(num_elems, max_token_len).long()
        tokens_mask = torch.zeros(num_elems, max_token_len).long()

        has_labels = False
        if "label" in batch[0]:
            labels = torch.zeros(num_elems).long()
            has_labels = True

        for i in range(num_elems):
            toks = batch[i]["input_ids"]
            length = len(toks)
            tokens[i, :length] = torch.LongTensor(toks if length <= max_token_len else toks[:max_token_len])
            tokens_mask[i, :length] = 1
            if has_labels:
                labels[i] = batch[i]["label"]

        # TODO: the first return None is just for backward compatibility -- can be removed
        if has_labels:
            return None, {"input_ids": tokens, "attention_mask": tokens_mask, "labels": labels}
        else:
            return None, {"input_ids": tokens, "attention_mask": tokens_mask}

May 11 '23 21:05 thorinf

Can you tell me if you get this error when dealing with long sequences “RuntimeError: The size of tensor a (512) must match the size of tensor b (549) at non-singleton dimension 1”
And I changed the code getitem to yours and it didn't work ，If you have also encountered this problem, please reply to me

thank you very much

May 13 '23 14:05 JarvanW5

minimal-text-diffusion minimal-text-diffusion copied to clipboard

Training on a large dataset is not working.

minimal-text-diffusion
minimal-text-diffusion copied to clipboard