NanoCode012 comments

Results 342 comments of


                                            NanoCode012

Integration of fused moe kernel (e.g., megablocks) for efficient moe training

@zinccat , correct me if I'm wrong but is the shape for the router mixed up? ``` self.weight = nn.Parameter(torch.empty(config.num_experts, config.hidden_size, dtype=torch.bfloat16)) ``` Should it be: ``` self.weight = nn.Parameter(torch.empty(...

fsdp with `cpu_ram_efficient_loading=false` results in NaN loss and gradient values

Hey, FSDP2 with `cpu_ram_efficient_loading` should work in Axolotl. Could you let me know if you've given it a try?

feat: add chat_template kwargs

CI passes and the change is minimal, so nothing major should be affected.

qa-lora integration

Re: https://github.com/axolotl-ai-cloud/axolotl/issues/2878#issuecomment-3051834944 > Offering help with QA-LoRA adapter merge process! Since PEFT doesn't support adapter merging with quantized models yet, I've implemented a custom solution. Successfully replicated the QA-LoRA paper...

qa-lora integration

@gapsong > I noticed the qzero values are currently being quantized during the save process. Could you share where this is happening in peft?

GRPO training calling DPO dataset processing logic

Hey, thanks for the Issue. One thing I noticed was that, the `type: chat_template`. In the linked example, we pointed to a new transform https://github.com/axolotl-ai-cloud/grpo_code/blob/148ea79321f34bbed79b3b55f04c0a7de002665d/grpo_code/transforms.py#L34 , which properly loads the...

GRPO training calling DPO dataset processing logic

Which model is this? Does vllm's EngineArgs support that param?

GRPO training calling DPO dataset processing logic

Thanks, can you try set the below to `None` https://github.com/axolotl-ai-cloud/axolotl/blob/7026cd5e9e053d51aa271c1f57f62950bcdc599f/src/axolotl/cli/vllm_serve.py#L65-L67 or alternatively, just delete this line: https://github.com/axolotl-ai-cloud/axolotl/blob/7026cd5e9e053d51aa271c1f57f62950bcdc599f/src/axolotl/cli/vllm_serve.py#L81

GRPO training calling DPO dataset processing logic

I haven't seen that CUDA graph log before. I'll ask the team. In meantime, where are you running this? Runpod? Locally?

GRPO training calling DPO dataset processing logic

Just to verify, are you able to run, `vllm serve ...` to see if it's a vllm issue or axolotl issue?