xavierdawn comments

Repositories
Issues
Comments

Results 1 comments of


                                            xavierdawn

The issue with the Data Selection Pipeline

> 你好，我不是原作者，我也在复现这篇工作。 > > 按照论文所报告的参数，bz大小为128，四个数据集合在一起总的数据量除以bz应该是105，也就是CKPT=105的来源。考虑到sh脚本里给的默认参数gradient accumulate step是32，他们的实验应该运行在4张GPU上，我认为你需要检查一下你实验中的设置有没有跟他对齐步骤一中使用四张卡训练的话，除了设置torchrun --nproc_per_node 4和export CUDA_VISIBLE_DEVICES=4,5,6,7还有什么地方需要设置吗，因为目前分布式训练好像并没有成功使用多张卡