oneflow 在单个机器上模拟多机并行，程序不会退出

在单个机器上模拟多机并行，程序不会退出

Open lmyybh opened this issue 3 years ago • 1 comments

Summary

在测试下面的 2 机 2 卡 混合并行脚本 test.py 时：

import oneflow as flow

P01 = flow.placement(type="cpu", ranks=[0, 1])
P23 = flow.placement(type="cpu", ranks=[2, 3])

# 模型第一阶段在第 0 和第 1 卡上进行数据并行计算
w0 = flow.randn(5, 8, placement=P01, sbp=flow.sbp.broadcast)
# 模型第二阶段在第 2 和第 3 卡上进行模型并行计算
w1 = flow.randn(8, 3, placement=P23, sbp=flow.sbp.split(dim=1))

# 随机生成数据模拟输入，
# 第一阶段需要将输入数据切分，用于数据并行
in_stage0 = flow.randn(4, 5, placement=P01, sbp=flow.sbp.split(dim=0))
out_stage0 = flow.matmul(in_stage0, w0)
print(out_stage0.shape) # (4, 8)

# 第二阶段需要将输入数据还原完整，并转移至第 2 和第 3 卡，用于模型并行
in_stage1 = out_stage0.to_global(placement=P23, sbp=flow.sbp.broadcast)
out_stage1 = flow.matmul(in_stage1, w1)
print(out_stage1.shape) # (4, 3)

由于没有两个机器，因此尝试在一个机器上开两个 terminal 运行命令：

第 1 个 terminal 运行：

python3 -m oneflow.distributed.launch --nnodes=2 --node_rank=0 --nproc_per_node=2 --master_addr="127.0.0.1" --master_port=7788 test.py

第 2 个 terminal 运行：

python3 -m oneflow.distributed.launch --nnodes=2 --node_rank=1 --nproc_per_node=2 --master_addr="127.0.0.1" --master_port=7788 test.py

发现程序可以运行到最后的 print ，但是不会退出，结果如图：

System Information

云平台配置：torch-1.9.0-cu11.1-cudnn8 v1.9.0 + 4core-14Gi-P40(1Card)

What is your OneFlow installation (pip, source, dockerhub): pip
OS:
OneFlow version (run python3 -m oneflow --doctor): 0.8.1.dev20220807+cu112
Python version: 3.7.7
CUDA driver version: 470.82.01 (cuda 11.4)
GPU models:
Other info:

Aug 08 '22 09:08 lmyybh

2机 * 2 设备的任务，这里尝试在单机去启动两个进程来跑。这种行为是未定义的。

这种特殊的执行方式是否要支持，按说应该是不支持就好了。

Aug 08 '22 09:08 strint

oneflow oneflow copied to clipboard

在单个机器上模拟多机并行，程序不会退出

Summary

System Information

oneflow
oneflow copied to clipboard