Exclude finetuning datasets from the `pretrain.py` arguments
The pretrain.py script lists the Alpaca dataset and all other finetuning datasets, but I don't think they are supported for finetuning.
E.g.,
python litgpt/pretrain.py \
--data litgpt.data.Alpaca2k \
--model_name tiny-llama-1.1b \
--tokenizer_dir checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T/
File "litgpt/pretrain.py", line 381, in <module>
CLI(setup)
File "/teamspace/studios/this_studio/lit-gpt/litgpt/utils.py", line 399, in CLI
return CLI(*args, **kwargs)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/jsonargparse/_cli.py", line 96, in CLI
return _run_component(components, cfg_init)
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/jsonargparse/_cli.py", line 193, in _run_component
return component(**cfg)
File "litgpt/pretrain.py", line 90, in setup
main(fabric, devices, seed, resume, config, data, out_dir, tokenizer_dir, tokenizer, train, eval)
File "litgpt/pretrain.py", line 155, in main
fit(fabric, devices, state, train_dataloader, val_dataloader, out_dir, tokenizer_dir, train, eval)
File "litgpt/pretrain.py", line 175, in fit
validate(fabric, model, val_dataloader, max_iters=2) # sanity check
File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
File "litgpt/pretrain.py", line 303, in validate
input_ids = batch[:, 0 : model.max_seq_length].contiguous().long()
TypeError: unhashable type: 'slice'
I think we should exclude those finetuning datasets from the pretraining args?
@awaelchli Should we add a PretrainingDataset just like we have for SFT? Then the pretrain file can set this as the expected type
@carmocca I'm not sure I understand. For finetuning we have datamodules that instantiate a SFTDataset. But this is an implementation detail of the data module. In similar fashion, the pretraining data module chooses the streaming dataset as it's internal way to index the data, but also this is an implementation detail that that the data module can choose.
If we wanted to restrict which scripts can use which data modules, we could have base classes for pretraining and finetuning.
If we wanted to restrict which scripts can use which data modules, we could have base classes for pretraining and fine-tuning.
This is what I meant