Thomas Capelle
Thomas Capelle
I used @lucidrains implementation
I can confirm the same error when finetuning Mistral with chatml format and deepspeed3. ``` loading model Traceback (most recent call last): File "/home/ubuntu/llm_recipes/scripts/push2hub.py", line 33, in model = AutoModelForCausalLM.from_pretrained(config.model_path,...
I am doing full tine tune, no qlora.
Hello! Can you sahre the W&B workspace with some context on the runs? What I do most of the time when I create process per node is using the group...
Cool, thanks for the heads up. You will need to manually create the runs with the rank name manually. I would do something like: ```python wandb.init( ... name=f"node_{global_rank}_local_rank_{local_rank}" group=f"node_{global_rank}", )...
please make your wandb project public so I can inspect...
Can I ask you to delete the non-relevant runs? How many nodes are you running?
Cool, so one run per GPU, you shouldn't need that. One process per node should suffice. What you can do, is wrap your init call with: ```python local_rank = int(os.environ['LOCAL_RANK'])...
There is currently a bug that hides the system metrics on processes that don't log any metric (this is the case for your non main processes). The workaround is logging...
Hey, can I help you here? Looks similiar to what I was working: https://github.com/pytorch/torchtune/pull/730