Jiawei Zhao comments

Results 16 comments of


                                            Jiawei Zhao

Third-party benchmark

Thanks for providing your results @WangRongsheng . We are working on efficiency optimization and you can expect a big throughput boost in the next version. For train_loss, did you tune...

Seems not compatible with DeepSpeed (perhaps also FSDP)

Thanks for your interest. We are getting in touch with FSDP team and will update it soon.

Seems not compatible with DeepSpeed (perhaps also FSDP)

Yes, this feature is still in development. Please stay tuned!

Double approximation of second moment in Adafactor

Thanks for pointing it out and this is correct. Vanilla Adafactor could face this issue but GaLore still reduces memory for Adafactor with momentum version.

be a bit more lenient on transformers version

Hi @winglian, the optimizer module itself supports latest transformers dependency, but torchrun_main.py is a bit outdated. I will merge your request once we upgrade torchrun_main.py. For now feel safe to...

be a bit more lenient on transformers version

PR can be closed as galore-torch does not require specific transformers version anymore

Training Time

Thanks for the suggestion and I will update the training time. @Explorergt92 for training 7B on a single 4090, I think "110 days around" is correct.

Dataset loading issue, integration with Colossal-AI

Thanks for the integration. I just tried again using the latest datasets version and it worked smoothly from my end. Is it possible due to other issue?

A few questions regarding the results and methodology.

Thanks for your inertest. Here are the answers to your questions: 1. For the memory reported in Table 4, it follows the same standard as the memory estimate in Table...

Reproducing Perplexity evaluation

The perplexity is measured by taking exp(total_loss), where it is computed by the [evaluation function](https://github.com/jiaweizzhao/GaLore/blob/864eeb361dc96c1932c3fa429ad0119aaed8e617/torchrun_main.py#L476).