Mayank Mishra
Mayank Mishra
I haven't tried adjusting input tokens @thies1006 But I can confirm, I ran with input text = "Hello" and generated tokens from 10, 50, 100, 300, 500, 1000, 2000, 5000....
I see @pai4451. Ill give it a shot.
@RezaYazdaniAminabadi any followup on this? I am facing similar CUDA issues with longer input sequence lengths.
@RezaYazdaniAminabadi I am also not sure but BLOOM is trained using ALiBi, ideally there should be no limit. I understand that this might not be possible. But GPT-3 allowed input...
@pai4451 https://www.deepspeed.ai/docs/config-json/#weight-quantization You can't use it that way. Please refer to this config. Let me know if it works ;)
As an alternative, you can use it in HuggingFace too. I haven't tried it either though.
@pai4451 https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/328#discussion_r954402510 you can use these instructions for quantization. However, this is a barebones script. I would encourage to wait for this PR: https://github.com/bigscience-workshop/Megatron-DeepSpeed/pull/328 Planning to add server + CLI...
Quantization with int8 requires knowledge distillation and might need significant compute. Read the zeroquant paper. I would suggest to get intenet access on the node if you can. I dont...
Also, can you provide me the ds config you use to run on 16 gpus? I dont know how to reshard for pipeline parallel. Do you save the resharded weights?...
This is still WIP @stas00