LLMs-from-scratch icon indicating copy to clipboard operation
LLMs-from-scratch copied to clipboard

Inconsistencies between the code in the book and the notebooks (2.6 Data sampling with a sliding window)

Open labdmitriy opened this issue 11 months ago • 7 comments

Hi @rasbt,

I noticed that in the book you provide the following code with function name create_dataloader and the argument stride = max_length + 1 to avoid overlap in data even for targets:

dataloader = create_dataloader(raw_text, batch_size=8, max_length=4,
stride=5)
data_iter = iter(dataloader)
inputs, targets = next(data_iter)
print("Inputs:\n", inputs)
print("\nTargets:\n", targets)

But in the cell of the jupyter notebook with main code (cell [43]) and jupyter notebook with only dataloader (cell [2]) you use function with name create_dataloader_v1 and argument stride = max_length.

Could you please tell do I understand correctly that we need to use stride = max_length + 1 to avoid overfitting? Does the overlap in target (when stride = max_length) seriously increase the risk of overfitting?

Thank you.

labdmitriy avatar Mar 03 '24 19:03 labdmitriy

Good eye for detail. Actually the +1 wasn't necessary so I updated that a while back in the notebook and manuscript. I think you are seeing the old +1 in the chapter because the Manning Staff hasn't synced it with my recent manuscripts yet. (It usually takes them a couple of weeks.)

rasbt avatar Mar 03 '24 23:03 rasbt

Thank you @rasbt, but then code in the notebook in several cells like [50] also should contain stride=max_length? Now it has +1.

labdmitriy avatar Mar 04 '24 01:03 labdmitriy

Ah yes, big thanks for the follow up! I think I may have missed one. I probably did a find+replace looking for stride=max_length+1 and then missed the one you had at the top (max_length=4, stride=5). Was there another place you saw?

rasbt avatar Mar 04 '24 01:03 rasbt

Hi @rasbt,

I think it was the last place with such case. Also I provided the differences between function names in the initial message (create_dataloader in the book vs create_dataloader_v1 in some places in the notebooks) - is it ok?

Thank you.

labdmitriy avatar Mar 04 '24 04:03 labdmitriy

(create_dataloader in the book vs create_dataloader_v1 in some places in the notebooks) - is it ok?

That's another place where the MEAP manuscript is a bit behind my private manuscript (+code). The reason is that we are specifically using class GPTDatasetV1(Dataset) inside that dataloader. I have different data loaders where you can't load the whole dataset into memory. I will probably implement them as V2 etc. in the bonus materials.

If you are curious, also check out alternative dataloader iteration for pretraining here: https://github.com/Lightning-AI/lit-gpt/blob/main/pretrain/tinyllama.py. The TinyLlama team successfully used our code to train a model on 3 trillion tokens: https://arxiv.org/abs/2401.02385 | https://huggingface.co/TinyLlama

rasbt avatar Mar 04 '24 14:03 rasbt

Thanks a lot for your explanation!

labdmitriy avatar Mar 04 '24 14:03 labdmitriy

Probably this notebook from Chapter 3 has also stride = max_length + 1 in the cell [1]:

max_length = 4
dataloader = create_dataloader(raw_text, batch_size=8, max_length=max_length, stride=5)

labdmitriy avatar Mar 10 '24 11:03 labdmitriy