Jack BAI
Jack BAI
神他喵加钱,改参数好哇
https://pytorch.org/docs/stable/generated/torch.nn.DataParallel.html
Further Hint: 只要把`device='cuda'`改成`device=cuda:0`就可以解决了
我单机四块V100,用的DistributedDataParallel+Apex,10秒一个epoch,5m的语料库(0.1B)一小时基本拟合完成.
已经加上去了
这个项目主要是PyTorch做的叭,可以说下哪里用到了tf吗?
Thanks a lot for your contribution. Would you like to provide snippet samples for using the hidden states - specifically, what does the returned `hidden_states` vector contain?
Just figured it out - so the hidden states output vector is a **concatenation** of all the hidden states at the last layer. From the functional aspect I would strongly...
Thanks for the fix. I also find that `return_hidden_states=True` makes the GPU usages keeps going up when using your patch and do `llm.generate`. I guess it can be solved by...
Confirmed that this fix solved the problem.