Thomas Capelle comments

Results 169 comments of


                                            Thomas Capelle

regression

I used @lucidrains implementation

RuntimeError: Error(s) in loading state_dict for MistralForCausalLM (Deepspeed Zero 3)

I can confirm the same error when finetuning Mistral with chatml format and deepspeed3. ``` loading model Traceback (most recent call last): File "/home/ubuntu/llm_recipes/scripts/push2hub.py", line 33, in model = AutoModelForCausalLM.from_pretrained(config.model_path,...

RuntimeError: Error(s) in loading state_dict for MistralForCausalLM (Deepspeed Zero 3)

I am doing full tine tune, no qlora.

Weights & Biases doesn't report logs from all nodes for distributed training using HuggingFace trainer

Hello! Can you sahre the W&B workspace with some context on the runs? What I do most of the time when I create process per node is using the group...

Weights & Biases doesn't report logs from all nodes for distributed training using HuggingFace trainer

Cool, thanks for the heads up. You will need to manually create the runs with the rank name manually. I would do something like: ```python wandb.init( ... name=f"node_{global_rank}_local_rank_{local_rank}" group=f"node_{global_rank}", )...

Weights & Biases doesn't report logs from all nodes for distributed training using HuggingFace trainer

please make your wandb project public so I can inspect...

Weights & Biases doesn't report logs from all nodes for distributed training using HuggingFace trainer

Can I ask you to delete the non-relevant runs? How many nodes are you running?

Weights & Biases doesn't report logs from all nodes for distributed training using HuggingFace trainer

Cool, so one run per GPU, you shouldn't need that. One process per node should suffice. What you can do, is wrap your init call with: ```python local_rank = int(os.environ['LOCAL_RANK'])...

Weights & Biases doesn't report logs from all nodes for distributed training using HuggingFace trainer

There is currently a bug that hides the system metrics on processes that don't log any metric (this is the case for your non main processes). The workaround is logging...

very wip metric logger improvements

Hey, can I help you here? Looks similiar to what I was working: https://github.com/pytorch/torchtune/pull/730