Casper comments

Results 291 comments of


                                            Casper

Issue with Multi-node training

Multi-node GRPO only works with `ray job submit -- python3 -u -m verl.trainer.main_ppo ...`

Multi-Node Quantization using Ray?

Hi @paolovic, at the moment this is not something explicitly supported or even something that I have attempted. I suspect it could be possible, but it's not something that I...

[BUG] OOM when train 70B models using deepspeed 0.16.4

Any update on this? Lots of people are waiting on this to be resolved, so they can upgrade to use the new AutoTP for additional optimization in their training

raise Exception (the loss increases to NAN ) when quantilizing DeepSeek-V2-chat using the new version of AutoAWQ in the sub-iteration (18/60)

Hi @BinFuPKU, thanks for raising the issue. I will need to further investigate what causes this, but I can see it will not be easy to debug since the model...

raise Exception (the loss increases to NAN ) when quantilizing DeepSeek-V2-chat using the new version of AutoAWQ in the sub-iteration (18/60)

@Kk1984up try upgrading to the newest version

cant import awq

I think you may have an issue with your torch installation. Try to reinstall torch

Feature Request: Support MS-AMP

+++ would love to see MS-AMP supported. Currently, H100s are on par with A100s cost-wise even with the current FP8 implementation, but if MS-AMP FP8 can be implemented, it is...

Feature Request: Support MS-AMP

Shouldn’t the FLOPs increase and thereby reducing training time? It should not be present on small models, but if you take a 30B, I would be surprised if you don’t...

about the shape of qzeros in awq quantization model

@MuYu-zhi please check out the gemm linear module. All weights are packed in a special way that is related to execution of CUDA kernels.

I made an initial attempt that did not work. https://github.com/casper-hansen/AutoAWQ/compare/main...gemma2. Unfortunately, I do not have enough time at the moment to do further research on how to support the new...