distributed-training-guide issues

accelerate library for distributed training.

At my best knowledge, huggingface [accelerate](https://huggingface.co/docs/accelerate/index) library is one of the most famous one for distributed training. Is there any plan for accelerate?

minjunsz

Use fork_from instead of resume with wandb.init

1

corey-lambda

suggestions for more examples

1

I finally made it to the end of the examples (well, the deepspeed one just hangs for me), and now I'm hungry for more! I think it would be nice...

daire-byrne

Add guidance on DDP bucket_cap_mb for larger models & torch.compile

2

The default is 25mb, which for even llama 8b is too small. Might be best to suggest use of max param size. Unclear how this impacts things - also guidance...

chelsea0x3b

MPI instructions missing local rank?

I tried the code changes for MPI as described in 03-job-launchers/README.md, but soon realised that the local rank was missing. I see that you added it as a command arg,...

daire-byrne

FSDP: train_llm.py not outputting LOGGER.info messages?

I've stared at the code and I can't figure out why the ddp & tensor scripts output the training stats every 10 seconds but the fsdp one doesn't? The code...

daire-byrne

FSDP: GPT2LMHeadModel object has no attribute model

2

First of all, thank you so much for putting together these tutorials! I am slowly working through them and trying to better understand how it all fits together. I have...

daire-byrne

docs: Rename link to Chapter 7 in README.md

Resolves #55

alexandervaneck

Optimizer Checkpointing with ZeRO

3

First of all, I have to say that these are phenomenal tutorials! But I came across the following issue. In the written tutorial for 02, you note that checkpointing sharded...

kennethSty

Chapter 7 link in readme is broken

fmokadem

distributed-training-guide
distributed-training-guide copied to clipboard

Metadata

accelerate library for distributed training.

Use fork_from instead of resume with wandb.init

suggestions for more examples

Add guidance on DDP bucket_cap_mb for larger models & torch.compile

MPI instructions missing local rank?

FSDP: train_llm.py not outputting LOGGER.info messages?

FSDP: GPT2LMHeadModel object has no attribute model

docs: Rename link to Chapter 7 in README.md

Optimizer Checkpointing with ZeRO

Chapter 7 link in readme is broken

← Metadata

Owner

Metadata

distributed-training-guide distributed-training-guide copied to clipboard

Metadata

← Metadata

Owner

Metadata

distributed-training-guide
distributed-training-guide copied to clipboard