minimal-text-diffusion
minimal-text-diffusion copied to clipboard
Training on a large dataset is not working.
Hey there, 😊 I want to train the DLM with a larger dataset ~17.000.000 sentences and this totally explodes my memory even though I have 200GB available. As far as I can assess, the whole training set is being tokenized in the beginning which causes the problem. Is there already a solution for it or are you aware of this problem? This line causes the problem.
Move the tokenisation to __getitem__
this runs the tokenizer in getitem, saving a bunch of memory. then the length cropping should also work such that can assign to tokens[i, :length] correctly if you have long sentences. There is a way to load the text in the getitem which saves even more memory, but may not be necesary, let me know if you need it.
def __getitem__(self, i):
encoded_input = self.tokenizer(self.text[i])
input_ids = encoded_input["input_ids"]
out_dict = {
"input_ids": input_ids,
# "input_ids": self.input_ids[i],
# "attention_mask": [1] * len(self.input_ids[i]),
}
if hasattr(self, "labels"):
out_dict["label"] = self.labels[i]
return out_dict
@staticmethod
def collate_pad(batch, cutoff: int):
max_token_len = 0
num_elems = len(batch)
# batch[0] -> __getitem__[0] --> returns a tuple (embeddings, out_dict)
for i in range(num_elems):
max_token_len = max(max_token_len, len(batch[i]["input_ids"]))
max_token_len = min(cutoff, max_token_len)
tokens = torch.zeros(num_elems, max_token_len).long()
tokens_mask = torch.zeros(num_elems, max_token_len).long()
has_labels = False
if "label" in batch[0]:
labels = torch.zeros(num_elems).long()
has_labels = True
for i in range(num_elems):
toks = batch[i]["input_ids"]
length = len(toks)
tokens[i, :length] = torch.LongTensor(toks if length <= max_token_len else toks[:max_token_len])
tokens_mask[i, :length] = 1
if has_labels:
labels[i] = batch[i]["label"]
# TODO: the first return None is just for backward compatibility -- can be removed
if has_labels:
return None, {"input_ids": tokens, "attention_mask": tokens_mask, "labels": labels}
else:
return None, {"input_ids": tokens, "attention_mask": tokens_mask}
Can you tell me if you get this error when dealing with long sequences “RuntimeError: The size of tensor a (512) must match the size of tensor b (549) at non-singleton dimension 1”
And I changed the code getitem to yours and it didn't work ,If you have also encountered this problem, please reply to me
thank you very much