NanoCode012 comments

Results 342 comments of


                                            NanoCode012

FutureWarning: FSDP.state_dict_type() and FSDP.set_state_dict_type() are being deprecated. Please use APIs, get_state_dict() and set_state_dict(), which can support different parallelisms, FSDP1, FSDP2, DDP.

Sure, thanks for the followup

Using liger plugin leads to trainer_cls becoming None

Thanks for the report. I did not see where in Liger plugin that calls `get_trainer_cls`. In fact, it shouldn't be doing so. Did you modify the plugin code? could you...

Using liger plugin leads to trainer_cls becoming None

Correction, this is likely an issue on our end. Could you please give the linked PR a try? This error sounded familiar as I fixed it once before in another...

Fix: Resolve torch_dtype correctly for AMP

Thanks for the PR and figuring it out! I think for a lot of people, they would prefer bf16 master weights to save VRAM. Could an alternative solution be, creating...

Fix: Resolve torch_dtype correctly for AMP

Ok, after some internal discussion, I'm good with this PR now. My next thought would whether to convert existing example yamls to use `bfloat16` for backward compatibility?

Fix: Resolve torch_dtype correctly for AMP

Still a todo on updating tests and warning about this. I think this warrants some sort of warning cycle before we switch as folks may be running from main. However,...

feat: add sageattention

@winglian , weirdly not getting vram savings as in benchmarks. Current **early** wandb result show that: about 20% faster with same vram usage. However, kernel benchmarking showed it using less...

feat: add sageattention

Updated PR from main and added more validation/docs on attention. It is a bit faster than FA for adapter mode. I added warning that this is not recommended for FFT...

Expand optimized LoRA kernels to lm_head & embed_tokens

I wonder if this is something that we can extend to, or is it unrelated? @djsaunde In meantime, have you tried any of our optimizations for cross entropy? CCE or...

Expand optimized LoRA kernels to lm_head & embed_tokens

@winglian , agreed that `embed_tokens` are not the expensive operations. For that linked feature, we can indeed add it, however, I'm not sure what's the most intuitive way for a...