HUAFOR comments

Results 7 comments of


                                            HUAFOR

运行非常慢，不到2分钟的音频，运行超过12个小时进行Face Renderer，最后因无法分配高达1.26GiB的内存空间而产生异常崩溃了。

请问训练这个模型的话最少需要的GPU是多少呢？

Train a model by "python -m torch.distributed.launch --nproc_per_node=2 opengait/main.py --cfgs ./config/baseline/baseline.yaml --phase train"

你好，这个问题我昨天在恒源云训练机器时候遇到过，最后解决，大概率是因为python3.8的问题导致的bug，在这里我提供我自己的解决方式,希望对你有帮助！：到目录/lib/python3.8/pkgutli.py下找到： try: importer = sys.path_importer_cache[path_item] 在这段话前面添加一行： path_item = os.fsdecode(path_item) ![image](https://user-images.githubusercontent.com/58834906/219601709-4172cafb-57e6-49b0-b3b0-3b4a0fde4b44.png) 即可解决。

HUAFOR

运行非常慢，不到2分钟的音频，运行超过12个小时进行Face Renderer，最后因无法分配高达1.26GiB的内存空间而产生异常崩溃了。

Train a model by "python -m torch.distributed.launch --nproc_per_node=2 opengait/main.py --cfgs ./config/baseline/baseline.yaml --phase train"

[BUG] Multi-GPU Training frozen after finishing first epoch

[BUG] Multi-GPU Training frozen after finishing first epoch

[BUG] Multi-GPU Training frozen after finishing first epoch

[BUG] deepspeed inference error on 2 node.

The size of tensor a (64) must match the size of tensor b (128) at non-singleton dimension 1