HUAFOR

Results 7 comments of HUAFOR

你好,这个问题我昨天在恒源云训练机器时候遇到过,最后解决,大概率是因为python3.8的问题导致的bug, 在这里我提供我自己的解决方式,希望对你有帮助!: 到目录/lib/python3.8/pkgutli.py下找到: try: importer = sys.path_importer_cache[path_item] 在这段话前面添加一行: path_item = os.fsdecode(path_item) ![image](https://user-images.githubusercontent.com/58834906/219601709-4172cafb-57e6-49b0-b3b0-3b4a0fde4b44.png) 即可解决。

I meet the same question. I think you can first add more specific logs to find the question before running your training command : export TORCH_DISTRIBUTED_DEBUG=DETAIL export DEEPSPEED_LOG_LEVEL=debug export OMPI_MCA_btl_base_verbose=1...

I'm training a diffusion pipeline and using the deepspeed-stage2 in 8 A100 GPUS. When training the first epoch ,everything goes well, however, when training the second epoch, the process is...

Thank you for your sharing, however, it doesn't work for my case/(ㄒoㄒ)/~~. Anyway, Thanks!

Any updates? I have the same issue? some NCCL operations have failed or timed out

我也是遇到了这个问题!作者可以回复一下吗?