NanoCode012 comments

Results 342 comments of


                                            NanoCode012

fix: Fix evaluation loss in KD trainer

Thanks for the PR. I've let the tests run first while we take some time to confirm the original issue. Good catch on the num_items_in_batch, it's still haunting us till...

fix: Fix evaluation loss in KD trainer

> @NanoCode012 All fixed! Used pop(None), removed the v2 override, and fixed the lint issue. Thanks, letting the test re-run

We currently use a custom implementation for muon https://github.com/axolotl-ai-cloud/axolotl-contribs-mit/blob/main/src/axolotl/contribs/mit/muon.py We're open to a PR for this if you would like to give it a try.

total_num_steps calculation is incorrect with sample_packing_eff_est

I wonder if this is because you're using `sample_packing_eff_est = 1.0`? I'm not sure reaching 1.0 is possible? If you try use `0.9`, does the calculation and num steps match?

Phi 3.5 Vision support - multi-modal support

Hey, thanks @aidando73 . I whipped up a PR for this if you're interested in testing it. I did not do any testing on it, so it's very rough especially...

Phi 3.5 Vision support - multi-modal support

@aidando73 just pushed a fix for this. I'll find some time later to iron bugs out

OOM for causal lm evaluation and missing logging

Hello, thanks for the report. It has been sometime since we worked on causal lm eval, so there may have been conflicts in logging. If you have any bandwidth, would...

Integration of fused moe kernel (e.g., megablocks) for efficient moe training

Yep, this is something I was just checking. I saw that upstream transformers EP PR was merged https://github.com/huggingface/transformers/pull/39501 . It uses `kernels-community/megablocks` (not sure if same as databrick's one) I...

Integration of fused moe kernel (e.g., megablocks) for efficient moe training

@zinccat , could you share how you got it working for qwen3 for reference purposes? This is currently a WIP for us.

NanoCode012

fix: Fix evaluation loss in KD trainer

fix: Fix evaluation loss in KD trainer

[New model] InternVL 35

Support muon with deepspeed

total_num_steps calculation is incorrect with sample_packing_eff_est

Phi 3.5 Vision support - multi-modal support

Phi 3.5 Vision support - multi-modal support

OOM for causal lm evaluation and missing logging

Integration of fused moe kernel (e.g., megablocks) for efficient moe training

Integration of fused moe kernel (e.g., megablocks) for efficient moe training