Eric Mitchell

ericmitchell.ai

Stanford, CA PhD Student at Stanford University

Results 21 comments of


                                            Eric Mitchell

load_checkpoint_and_dispatch compatibility with accelerate FSDP?

Got it- that's what I thought, but I wasn't sure how to let FSDP instantiate the model on several devices when doing the initial load on a meta device +...

load_checkpoint_and_dispatch compatibility with accelerate FSDP?

@sgugger @pacman100 After some more experimentation, I think this *almost* gets the job done: ``` with init_empty_weights(): policy = transformers.AutoModelForCausalLM.from_pretrained("EleutherAI/pythia-6.9b", cache_dir=get_cache_dir()) def reset_parameters(self) -> None: pass # dummy function for...

load_checkpoint_and_dispatch compatibility with accelerate FSDP?

To update on this, I think with regular torch FSDP memory-efficient FSDP initialization is possible by: - Only loading model parameters on the rank 0 device (load on the meta...

load_checkpoint_and_dispatch compatibility with accelerate FSDP?

@pacman100 sorry for the slow reply. I don't actually have a working example- this was just a hypothesis inspired by [this comment](https://github.com/pytorch/pytorch/blob/bffcfa9628d4c8e858ef5f2aeab34e021885e682/torch/distributed/fsdp/api.py#L302) in the PyTorch source. I will look into...

load_checkpoint_and_dispatch compatibility with accelerate FSDP?

@pacman100 Sorry again for the delay. For an example of this approach and discussion of some of the issues with it, check out https://github.com/pytorch/pytorch/issues/104026 on the PyTorch Github issues. I...

[App] Table not updating it at each call of log

@ramit-wandb just wanted to check if there was any update to this issue! Thanks a lot.

[App] Table not updating it at each call of log

Got it- maybe the blog post/docs [here](https://wandb.ai/stacey/nlg/reports/Tables-Tutorial-Visualize-Text-Data-Predictions---Vmlldzo1NzcwNzY) should be updated if this feature doesn't work for now, then?

fix a memory leak on exception (caused by the stored traceback)

Is there any update to this PR? Currently struggling to gracefully respond to CUDA OOM issues in a Jupyter notebook, and this fix looks promising.

Slow inference performance for large Llama models compared to naive MP

Would love to know if there is any update on this issue @BlackSamorez. `tensor_parallel` works great for us for training (nice job!), but the inability to actually sample from the...

XSum Fake Samples

Hi! Sorry for the delayed response. If you run the eval scripts, you should get the same samples that we did, since the random seed is set. Or am I...

1
2
3
›