litgpt Exclude finetuning datasets from the `pretrain.py` arguments

The pretrain.py script lists the Alpaca dataset and all other finetuning datasets, but I don't think they are supported for finetuning.

E.g.,

python litgpt/pretrain.py \
  --data litgpt.data.Alpaca2k \
  --model_name tiny-llama-1.1b  \
  --tokenizer_dir checkpoints/TinyLlama/TinyLlama-1.1B-intermediate-step-1431k-3T/

  File "litgpt/pretrain.py", line 381, in <module>
    CLI(setup)
  File "/teamspace/studios/this_studio/lit-gpt/litgpt/utils.py", line 399, in CLI
    return CLI(*args, **kwargs)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/jsonargparse/_cli.py", line 96, in CLI
    return _run_component(components, cfg_init)
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/jsonargparse/_cli.py", line 193, in _run_component
    return component(**cfg)
  File "litgpt/pretrain.py", line 90, in setup
    main(fabric, devices, seed, resume, config, data, out_dir, tokenizer_dir, tokenizer, train, eval)
  File "litgpt/pretrain.py", line 155, in main
    fit(fabric, devices, state, train_dataloader, val_dataloader, out_dir, tokenizer_dir, train, eval)
  File "litgpt/pretrain.py", line 175, in fit
    validate(fabric, model, val_dataloader, max_iters=2)  # sanity check
  File "/home/zeus/miniconda3/envs/cloudspace/lib/python3.8/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
    return func(*args, **kwargs)
  File "litgpt/pretrain.py", line 303, in validate
    input_ids = batch[:, 0 : model.max_seq_length].contiguous().long()
TypeError: unhashable type: 'slice'

I think we should exclude those finetuning datasets from the pretraining args?

Mar 11 '24 21:03 rasbt

@awaelchli Should we add a PretrainingDataset just like we have for SFT? Then the pretrain file can set this as the expected type

Mar 11 '24 22:03 carmocca

@carmocca I'm not sure I understand. For finetuning we have datamodules that instantiate a SFTDataset. But this is an implementation detail of the data module. In similar fashion, the pretraining data module chooses the streaming dataset as it's internal way to index the data, but also this is an implementation detail that that the data module can choose.

If we wanted to restrict which scripts can use which data modules, we could have base classes for pretraining and finetuning.

Mar 12 '24 00:03 awaelchli

If we wanted to restrict which scripts can use which data modules, we could have base classes for pretraining and fine-tuning.

This is what I meant

Mar 12 '24 02:03 carmocca