Results 16 comments of Jiawei Zhao

Thanks for providing your results @WangRongsheng . We are working on efficiency optimization and you can expect a big throughput boost in the next version. For train_loss, did you tune...

Thanks for your interest. We are getting in touch with FSDP team and will update it soon.

Yes, this feature is still in development. Please stay tuned!

Thanks for pointing it out and this is correct. Vanilla Adafactor could face this issue but GaLore still reduces memory for Adafactor with momentum version.

Hi @winglian, the optimizer module itself supports latest transformers dependency, but torchrun_main.py is a bit outdated. I will merge your request once we upgrade torchrun_main.py. For now feel safe to...

PR can be closed as galore-torch does not require specific transformers version anymore

Thanks for the suggestion and I will update the training time. @Explorergt92 for training 7B on a single 4090, I think "110 days around" is correct.

Thanks for the integration. I just tried again using the latest datasets version and it worked smoothly from my end. Is it possible due to other issue?

Thanks for your inertest. Here are the answers to your questions: 1. For the memory reported in Table 4, it follows the same standard as the memory estimate in Table...

The perplexity is measured by taking exp(total_loss), where it is computed by the [evaluation function](https://github.com/jiaweizzhao/GaLore/blob/864eeb361dc96c1932c3fa429ad0119aaed8e617/torchrun_main.py#L476).