minimal-text-diffusion icon indicating copy to clipboard operation
minimal-text-diffusion copied to clipboard

Training on a large dataset is not working.

Open mainpyp opened this issue 2 years ago • 3 comments

Hey there, 😊 I want to train the DLM with a larger dataset ~17.000.000 sentences and this totally explodes my memory even though I have 200GB available. As far as I can assess, the whole training set is being tokenized in the beginning which causes the problem. Is there already a solution for it or are you aware of this problem? This line causes the problem.

mainpyp avatar Mar 02 '23 14:03 mainpyp

Move the tokenisation to __getitem__

thorinf avatar May 11 '23 09:05 thorinf

this runs the tokenizer in getitem, saving a bunch of memory. then the length cropping should also work such that can assign to tokens[i, :length] correctly if you have long sentences. There is a way to load the text in the getitem which saves even more memory, but may not be necesary, let me know if you need it.

    def __getitem__(self, i):
        encoded_input = self.tokenizer(self.text[i])
        input_ids = encoded_input["input_ids"]
        out_dict = {
            "input_ids": input_ids,
            # "input_ids": self.input_ids[i],
            # "attention_mask": [1] * len(self.input_ids[i]),
        }
        if hasattr(self, "labels"):
            out_dict["label"] = self.labels[i]
        return out_dict

    @staticmethod
    def collate_pad(batch, cutoff: int):
        max_token_len = 0
        num_elems = len(batch)
        # batch[0] -> __getitem__[0] --> returns a tuple (embeddings, out_dict)

        for i in range(num_elems):
            max_token_len = max(max_token_len, len(batch[i]["input_ids"]))

        max_token_len = min(cutoff, max_token_len)

        tokens = torch.zeros(num_elems, max_token_len).long()
        tokens_mask = torch.zeros(num_elems, max_token_len).long()

        has_labels = False
        if "label" in batch[0]:
            labels = torch.zeros(num_elems).long()
            has_labels = True

        for i in range(num_elems):
            toks = batch[i]["input_ids"]
            length = len(toks)
            tokens[i, :length] = torch.LongTensor(toks if length <= max_token_len else toks[:max_token_len])
            tokens_mask[i, :length] = 1
            if has_labels:
                labels[i] = batch[i]["label"]

        # TODO: the first return None is just for backward compatibility -- can be removed
        if has_labels:
            return None, {"input_ids": tokens, "attention_mask": tokens_mask, "labels": labels}
        else:
            return None, {"input_ids": tokens, "attention_mask": tokens_mask}

thorinf avatar May 11 '23 21:05 thorinf

Can you tell me if you get this error when dealing with long sequences “RuntimeError: The size of tensor a (512) must match the size of tensor b (549) at non-singleton dimension 1”
And I changed the code getitem to yours and it didn't work ,If you have also encountered this problem, please reply to me

thank you very much

JarvanW5 avatar May 13 '23 14:05 JarvanW5