CodeParrot: Iterable Dataset Questions
Hello @lvwerra,
I have a few questions about the iterable dataset class in the Code Parrot blog post.
- How is
num_of_sequenceschosen? - If I am using a
seq_lengthof 2048 shouldself.input_characters = 2048 * 3.6 * 2048? - Does
num_sequencesneed to matchseq_length? - If I am using a larger dataset such as Pile (~850 GB), should
num_of_sequencesandchars_per_tokenbe higher? - Is there an exact way to calculate the parameter for
num_of_sequences, similar to howchars_per_tokenwas estimated? - Would setting
drop_last=Truein the data loader help to prevent the grey discarded tokens shown in the example chart?
Thank you for the help,
Enrico
Example code:
tokenizer = GPT2Tokenizer(vocab_file='/token/vocab.json', merges_file='/token/merges.txt')
class ConstantLengthDataset(IterableDataset):
def __init__(
self, tokenizer, dataset, infinite=False, seq_length=1024, num_of_sequences=1024, chars_per_token=3.6
):
self.tokenizer = tokenizer
self.concat_token_id = tokenizer.bos_token_id
self.dataset = dataset
self.seq_length = seq_length
self.input_characters = seq_length * chars_per_token * num_of_sequences
self.epoch = 0
self.infinite = infinite
def __iter__(self):
iterator = iter(self.dataset)
more_examples = True
while more_examples:
buffer, buffer_len = [], 0
while True:
if buffer_len >= self.input_characters:
break
try:
buffer.append(next(iterator)["text"])
buffer_len += len(buffer[-1])
except StopIteration:
if self.infinite:
iterator = iter(self.dataset)
self.epoch += 1
logger.info(f"Dataset epoch: {self.epoch}")
else:
more_examples = False
break
tokenized_inputs = self.tokenizer(buffer, truncation=False)["input_ids"]
all_token_ids = []
for tokenized_input in tokenized_inputs:
all_token_ids.extend(tokenized_input + [self.concat_token_id])
for i in range(0, len(all_token_ids), self.seq_length):
input_ids = all_token_ids[i : i + self.seq_length]
if len(input_ids) == self.seq_length:
yield torch.tensor(input_ids)
train_data = load_dataset("the_pile", 'enron_emails', split="train", streaming=True)
#train_data = train_data.shuffle(buffer_size=args.shuffle_buffer, seed=args.seed)
train_dataset = ConstantLengthDataset(tokenizer, train_data, seq_length=2048)
train_dataloader = DataLoader(train_dataset, drop_last=True, batch_size=16)
cc @lvwerra, but some of these questions might go better in the forum :parrot:
Hi @conceptofmind,
-
num_of_sequences(number of sequences preprocessed at a time) is an estimate of how many sequences of lengthseq_length(length of sequence fed into model) we want to concatenate at a time. The smallernum_of_sequencesthe higher the fraction of leftover tokens at the end that we throw away during preprocessing (the grey ones in the graphic). Worst case the last chunk is 1023 tokens long which we have to throw away.
- if
num_of_sequences=10worst case we loose 10% of data - if
num_of_sequences=100worst case we loose 1% of data - if
num_of_sequences=1000worst case we loose 0.1% of data the last option seemed acceptable to us, but you can increase or decease it to your needs.
2 & 3: no they are independent so you can increase one without the other.
4 & 5: the only parameter that depends a bit on your setup is chars_per_token: it describes how many characters are usually encoded in a token. We computed it by tokenizing a bunch of texts and compare the character length to the token length of the texts.
6: I don't think so as this concerns the input batches and not the text samples. If you have 800GB of text it would only concern the very last batch (if it does not have enough samples to fill a full batch at the end).
Note that this is just a heuristic: We just estimate how many tokens we roughly need to gather (self.input_characters) to have approximately num_of_sequences but in practice we sometimes have a few less or more.
Hope this helps!
Hi @lvwerra ,
I greatly appreciate you taking the time to answer my questions. I found the CodeParrot blog very insightful for causal lanugauge modeling. This is one of the best examples, and explanations for streaming datasets of a certain sequence length.
In the blog post, in the Constant Length Dataset, a tokenizer.bos_token_id is used for the concatenation token in self.concat_token_id = tokenizer.bos_token_id. Should it be an eos token instead or does using a bos token make no difference in training?
Additionally, I had just purchased your textbook: Natural Language Processing with Transformers, and the Iterable Dataset code is structured a bit differently.
all_token_ids = []
tokenized_inputs = self.tokenizer(buffer, truncation=False)
for tokenized_input in tokenized_inputs["input_ids'"]:
for tokenized_input in tokenized_inputs:
all_token_ids.extend(tokenized_input + [self.concat_token_id])
for i in range(0, len(all_token_ids), self.seq_length):
input_ids = all_token_ids[i : i + self.seq_length]
if len(input_ids) == self.seq_length:
yield torch.tensor(input_ids)
Is this Iterable Dataset more up to date than the one in the CodeParrot Blog Post? Should there be a second for loop, for tokenized_input in tokenized_inputs: after for tokenized_input in tokenized_inputs["input_ids'"]:? Or should it only be for tokenized_input in tokenized_inputs["input_ids'"]: ?
If it is more appropriate to directly email you or message separately about this, I can edit and close this issue.
Again, I greatly appreciate your help.
Thank you,
Enrico