blog CodeParrot: Iterable Dataset Questions

Hello @lvwerra,

I have a few questions about the iterable dataset class in the Code Parrot blog post.

How is num_of_sequences chosen?
If I am using a seq_length of 2048 should self.input_characters = 2048 * 3.6 * 2048?
Does num_sequences need to match seq_length?
If I am using a larger dataset such as Pile (~850 GB), should num_of_sequences and chars_per_token be higher?
Is there an exact way to calculate the parameter for num_of_sequences, similar to how chars_per_token was estimated?
Would setting drop_last=True in the data loader help to prevent the grey discarded tokens shown in the example chart?

Thank you for the help,

Enrico

Example code:

tokenizer = GPT2Tokenizer(vocab_file='/token/vocab.json', merges_file='/token/merges.txt')

class ConstantLengthDataset(IterableDataset):
    def __init__(
        self, tokenizer, dataset, infinite=False, seq_length=1024, num_of_sequences=1024, chars_per_token=3.6
    ):
        self.tokenizer = tokenizer
        self.concat_token_id = tokenizer.bos_token_id
        self.dataset = dataset
        self.seq_length = seq_length
        self.input_characters = seq_length * chars_per_token * num_of_sequences
        self.epoch = 0
        self.infinite = infinite

    def __iter__(self):
        iterator = iter(self.dataset)
        more_examples = True
        while more_examples:
            buffer, buffer_len = [], 0
            while True:
                if buffer_len >= self.input_characters:
                    break
                try:
                    buffer.append(next(iterator)["text"])
                    buffer_len += len(buffer[-1])
                except StopIteration:
                    if self.infinite:
                        iterator = iter(self.dataset)
                        self.epoch += 1
                        logger.info(f"Dataset epoch: {self.epoch}")
                    else:
                        more_examples = False
                        break
            tokenized_inputs = self.tokenizer(buffer, truncation=False)["input_ids"]
            all_token_ids = []
            for tokenized_input in tokenized_inputs:
                all_token_ids.extend(tokenized_input + [self.concat_token_id])
            for i in range(0, len(all_token_ids), self.seq_length):
                input_ids = all_token_ids[i : i + self.seq_length]
                if len(input_ids) == self.seq_length:
                    yield torch.tensor(input_ids)

train_data = load_dataset("the_pile", 'enron_emails', split="train", streaming=True)
#train_data = train_data.shuffle(buffer_size=args.shuffle_buffer, seed=args.seed)
train_dataset = ConstantLengthDataset(tokenizer, train_data, seq_length=2048)
train_dataloader = DataLoader(train_dataset, drop_last=True, batch_size=16)

May 03 '22 03:05 conceptofmind

cc @lvwerra, but some of these questions might go better in the forum :parrot:

May 03 '22 07:05 osanseviero

Hi @conceptofmind,

num_of_sequences (number of sequences preprocessed at a time) is an estimate of how many sequences of length seq_length (length of sequence fed into model) we want to concatenate at a time. The smaller num_of_sequences the higher the fraction of leftover tokens at the end that we throw away during preprocessing (the grey ones in the graphic). Worst case the last chunk is 1023 tokens long which we have to throw away.

if num_of_sequences=10 worst case we loose 10% of data
if num_of_sequences=100 worst case we loose 1% of data
if num_of_sequences=1000 worst case we loose 0.1% of data the last option seemed acceptable to us, but you can increase or decease it to your needs.

2 & 3: no they are independent so you can increase one without the other.

4 & 5: the only parameter that depends a bit on your setup is chars_per_token: it describes how many characters are usually encoded in a token. We computed it by tokenizing a bunch of texts and compare the character length to the token length of the texts.

6: I don't think so as this concerns the input batches and not the text samples. If you have 800GB of text it would only concern the very last batch (if it does not have enough samples to fill a full batch at the end).

Note that this is just a heuristic: We just estimate how many tokens we roughly need to gather (self.input_characters) to have approximately num_of_sequences but in practice we sometimes have a few less or more.

Hope this helps!

May 03 '22 14:05 lvwerra

Hi @lvwerra ,

I greatly appreciate you taking the time to answer my questions. I found the CodeParrot blog very insightful for causal lanugauge modeling. This is one of the best examples, and explanations for streaming datasets of a certain sequence length.

In the blog post, in the Constant Length Dataset, a tokenizer.bos_token_id is used for the concatenation token in self.concat_token_id = tokenizer.bos_token_id. Should it be an eos token instead or does using a bos token make no difference in training?

Additionally, I had just purchased your textbook: Natural Language Processing with Transformers, and the Iterable Dataset code is structured a bit differently.

all_token_ids = []
tokenized_inputs = self.tokenizer(buffer, truncation=False)
for tokenized_input in tokenized_inputs["input_ids'"]:
for tokenized_input in tokenized_inputs:
    all_token_ids.extend(tokenized_input + [self.concat_token_id])

for i in range(0, len(all_token_ids), self.seq_length):
    input_ids = all_token_ids[i : i + self.seq_length]
    if len(input_ids) == self.seq_length:
        yield torch.tensor(input_ids)

Is this Iterable Dataset more up to date than the one in the CodeParrot Blog Post? Should there be a second for loop, for tokenized_input in tokenized_inputs: after for tokenized_input in tokenized_inputs["input_ids'"]:? Or should it only be for tokenized_input in tokenized_inputs["input_ids'"]: ?

If it is more appropriate to directly email you or message separately about this, I can edit and close this issue.

Again, I greatly appreciate your help.

Thank you,

Enrico

May 05 '22 17:05 conceptofmind