Carlos Mocholí

Results 90 issues of Carlos Mocholí

https://github.com/Lightning-AI/lit-parrot/blob/main/scripts/convert_hf_checkpoint.py is a script that converts a list of `*.bin` files into a single checkpoint file: `lit_model.pth`. This has the disadvantage of: - adds 1 extra step to get started...

enhancement

See posted comments for in-depth explanations. Memory usage was gathered with ```python with torch.profiler.profile(record_shapes=True, profile_memory=True, with_stack=True) as p: # the training loop ... from torch.cuda._memory_viz import profile_plot with open('memory.html', 'w')...

TODO: rerun the memory requirements

Preview: https://github.com/Lightning-AI/lit-gpt/blob/carmocca/oom-howto/howto/oom.md

Proposed by @robieta I removed the lora context manager in favor of a separate model to implement this, just as we do for adapter.

The same script settings on single device do not produce NaNs

bug
fine-tuning

Port of `pretrain/openwebtext.py` using the `Trainer`.

### Bug description Since data in the spawned region is not shared with the main process, the spawn launcher saves a checkpoint of the weights before finishing that is then...

bug
priority: 1
strategy: ddp
strategy: xla
ver: 2.0.x