Carlos Mocholí
Carlos Mocholí
https://github.com/Lightning-AI/lit-parrot/blob/main/scripts/convert_hf_checkpoint.py is a script that converts a list of `*.bin` files into a single checkpoint file: `lit_model.pth`. This has the disadvantage of: - adds 1 extra step to get started...
See posted comments for in-depth explanations. Memory usage was gathered with ```python with torch.profiler.profile(record_shapes=True, profile_memory=True, with_stack=True) as p: # the training loop ... from torch.cuda._memory_viz import profile_plot with open('memory.html', 'w')...
TODO: rerun the memory requirements
Preview: https://github.com/Lightning-AI/lit-gpt/blob/carmocca/oom-howto/howto/oom.md
Proposed by @robieta I removed the lora context manager in favor of a separate model to implement this, just as we do for adapter.
The same script settings on single device do not produce NaNs
Port of `pretrain/openwebtext.py` using the `Trainer`.
### Bug description Since data in the spawned region is not shared with the main process, the spawn launcher saves a checkpoint of the weights before finishing that is then...