Jiaxin Shan

Results 742 comments of Jiaxin Shan

Is there a way to support distributed training? We do not have that many 80G cards.

@davidearlyoung Thanks for all the details. Do you know whether distributed inference works or not? We have some A100-40G cards and do not like to sacrifice use quantization. We are...

@davidearlyoung I really appreciate your informative analysis! Thanks a lot! > I personally do not know if distributed inference works for grok-1 in pytorch. Yes! that's my questions to the...

@martindurant Yes. That's exact what I want. Do you have any suggestion for those custom protocols? Should we do it downstream or upstream?

> I had this too and fixed the issue by deleting the npx directory (`~/.npm/_npx/$SOME_ID_HERE`). > > You should see the path in the error for the relevant directory. deleting...

``` ubuntu@192-9-155-93:~/alpaca-lora$ cp /home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cuda117.so /home/ubuntu/.local/lib/python3.8/site-packages/bitsandbytes/libbitsandbytes_cpu.so ``` I meet the exact same problem and this works for me

@danwei1992 Not yet. Did you encounter the same problem?

@VVNMA not yet. I will give another try tomorrow

@VVNMA and other users who has exact same issues as me. here's the update. > Note: OOM issue could be a separate issue, let's talk about it in new threads...

@SeungyounShin How long does it take? Can you also share the training logs? I am blocked at this step.. ``` root@5d83a2b86756:~/stanford_alpaca# torchrun --nproc_per_node=4 --master_port=3192 train.py --model_name_or_path /root/models/llama_7B --data_path ./alpaca_data.json --bf16...