Carlos Mocholí issues

Results 90 issues of


                                            Carlos Mocholí

Avoid the `convert_hf_checkpoint` step

https://github.com/Lightning-AI/lit-parrot/blob/main/scripts/convert_hf_checkpoint.py is a script that converts a list of `*.bin` files into a single checkpoint file: `lit_model.pth`. This has the disadvantage of: - adds 1 extra step to get started...

enhancement

Memory usage improvements

See posted comments for in-depth explanations. Memory usage was gathered with ```python with torch.profiler.profile(record_shapes=True, profile_memory=True, with_stack=True) as p: # the training loop ... from torch.cuda._memory_viz import profile_plot with open('memory.html', 'w')...

fine-tuning

Pretrain script with the Trainer

Port of `pretrain/openwebtext.py` using the `Trainer`.

Calling `trainer.fit` twice with spawn strategies won't work as expected

### Bug description Since data in the spawned region is not shared with the main process, the spawn launcher saves a checkpoint of the weights before finishing that is then...

bug

priority: 1

strategy: ddp

strategy: xla

ver: 2.0.x

Carlos Mocholí

Avoid the `convert_hf_checkpoint` step

Memory usage improvements

bf16-mixed default for adapter

Compute TPU FLOPs

Add OOM howto

Speed monitor improvements

Chunked LM head for lower peak memory during finetuning

Loss is NaN when finetuning with FSDP

Pretrain script with the Trainer

Calling `trainer.fit` twice with spawn strategies won't work as expected