HUAFOR
                                            HUAFOR
                                        
                                    请问训练这个模型的话最少需要的GPU是多少呢?
你好,这个问题我昨天在恒源云训练机器时候遇到过,最后解决,大概率是因为python3.8的问题导致的bug, 在这里我提供我自己的解决方式,希望对你有帮助!: 到目录/lib/python3.8/pkgutli.py下找到: try: importer = sys.path_importer_cache[path_item] 在这段话前面添加一行: path_item = os.fsdecode(path_item)  即可解决。
I meet the same question. I think you can first add more specific logs to find the question before running your training command : export TORCH_DISTRIBUTED_DEBUG=DETAIL export DEEPSPEED_LOG_LEVEL=debug export OMPI_MCA_btl_base_verbose=1...
I'm training a diffusion pipeline and using the deepspeed-stage2 in 8 A100 GPUS. When training the first epoch ,everything goes well, however, when training the second epoch, the process is...
Thank you for your sharing, however, it doesn't work for my case/(ㄒoㄒ)/~~. Anyway, Thanks!
Any updates? I have the same issue? some NCCL operations have failed or timed out
我也是遇到了这个问题!作者可以回复一下吗?
讨论的主题是什么?+
非常感谢您的回复!我想请问您是否可以考虑提供 测试得到上述表格(Table1 in the paper)各列指标数据 所用到的源码(例如:inference_multi.py)?对于我来说复现这部分的逻辑似乎有些困难。