Thomas Capelle

Results 169 comments of Thomas Capelle

I used @lucidrains implementation

I can confirm the same error when finetuning Mistral with chatml format and deepspeed3. ``` loading model Traceback (most recent call last): File "/home/ubuntu/llm_recipes/scripts/push2hub.py", line 33, in model = AutoModelForCausalLM.from_pretrained(config.model_path,...

Hello! Can you sahre the W&B workspace with some context on the runs? What I do most of the time when I create process per node is using the group...

Cool, thanks for the heads up. You will need to manually create the runs with the rank name manually. I would do something like: ```python wandb.init( ... name=f"node_{global_rank}_local_rank_{local_rank}" group=f"node_{global_rank}", )...

Can I ask you to delete the non-relevant runs? How many nodes are you running?

Cool, so one run per GPU, you shouldn't need that. One process per node should suffice. What you can do, is wrap your init call with: ```python local_rank = int(os.environ['LOCAL_RANK'])...

There is currently a bug that hides the system metrics on processes that don't log any metric (this is the case for your non main processes). The workaround is logging...

Hey, can I help you here? Looks similiar to what I was working: https://github.com/pytorch/torchtune/pull/730