LLMs-from-scratch
LLMs-from-scratch copied to clipboard
Inconsistencies between the code in the book and the notebooks (2.6 Data sampling with a sliding window)
Hi @rasbt,
I noticed that in the book you provide the following code with function name create_dataloader
and the argument stride = max_length + 1
to avoid overlap in data even for targets:
dataloader = create_dataloader(raw_text, batch_size=8, max_length=4,
stride=5)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)
But in the cell of the jupyter notebook with main code (cell [43]) and jupyter notebook with only dataloader (cell [2]) you use function with name create_dataloader_v1
and argument stride = max_length
.
Could you please tell do I understand correctly that we need to use stride = max_length + 1
to avoid overfitting? Does the overlap in target (when stride = max_length
) seriously increase the risk of overfitting?
Thank you.
Good eye for detail. Actually the +1 wasn't necessary so I updated that a while back in the notebook and manuscript. I think you are seeing the old +1 in the chapter because the Manning Staff hasn't synced it with my recent manuscripts yet. (It usually takes them a couple of weeks.)
Thank you @rasbt, but then code in the notebook in several cells like [50] also should contain stride=max_length? Now it has +1.
Ah yes, big thanks for the follow up! I think I may have missed one. I probably did a find+replace looking for stride=max_length+1
and then missed the one you had at the top (max_length=4, stride=5
). Was there another place you saw?
Hi @rasbt,
I think it was the last place with such case. Also I provided the differences between function names in the initial message (create_dataloader in the book vs create_dataloader_v1 in some places in the notebooks) - is it ok?
Thank you.
(create_dataloader in the book vs create_dataloader_v1 in some places in the notebooks) - is it ok?
That's another place where the MEAP manuscript is a bit behind my private manuscript (+code). The reason is that we are specifically using class GPTDatasetV1(Dataset)
inside that dataloader. I have different data loaders where you can't load the whole dataset into memory. I will probably implement them as V2 etc. in the bonus materials.
If you are curious, also check out alternative dataloader iteration for pretraining here: https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py. The TinyLlama team successfully used our code to train a model on 3 trillion tokens: https://arxiv.org/abs/2401.02385 | https://huggingface.co/TinyLlama
Thanks a lot for your explanation!
Probably this notebook from Chapter 3 has also stride = max_length + 1
in the cell [1]:
max_length = 4
dataloader = create_dataloader(raw_text, batch_size=8, max_length=max_length, stride=5)