oneflow
oneflow copied to clipboard
Refactor dataloader rdma
问题背景
之前为了解决dataloader中worker进程初始化时fork进程和RDMA冲突(segmentation fault),在dataloader中使用了destory_rdma接口,但这个用法在lazy静态图下会有问题,具体见:
- https://github.com/Oneflow-Inc/OneTeam/issues/1794#issuecomment-1328744312
- https://github.com/Oneflow-Inc/OneTeam/issues/1463#issuecomment-1330246528
解决方案
此PR移除了dataloader中destory_rdma相关的改动,而采取之前的做法:
- 如果确认需要在环境中使用RDMA,则需要保证RDMA init在DataLoader创建之后。由于先初始化DataLoader(初始化时即创建worker进程),后使用RDMA即可避免问题;
- 通过增加一个iterator变量来控制iterator reset时机,解决了数据丢失的问题
- [x] test case:基于 @xiezipeng-ML 提供的libai的case基于27机器测试通过
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
Speed stats:
GPU Name: GeForce GTX 1080
❌ OneFlow resnet50 time: 142.5ms (= 14248.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 165.4ms (= 16539.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 165.4ms / 142.5ms)
OneFlow resnet50 time: 87.3ms (= 8733.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 103.2ms (= 10321.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 103.2ms / 87.3ms)
OneFlow resnet50 time: 58.7ms (= 11739.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 85.2ms (= 17036.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.45 (= 85.2ms / 58.7ms)
OneFlow resnet50 time: 45.6ms (= 9127.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.2ms (= 14439.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.58 (= 72.2ms / 45.6ms)
OneFlow resnet50 time: 41.2ms (= 8233.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.0ms (= 15394.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.87 (= 77.0ms / 41.2ms)
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9492/
Speed stats:
GPU Name: GeForce GTX 1080
❌ OneFlow resnet50 time: 140.1ms (= 14014.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 162.6ms (= 16262.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 162.6ms / 140.1ms)
OneFlow resnet50 time: 84.9ms (= 8493.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.6ms (= 10160.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 101.6ms / 84.9ms)
OneFlow resnet50 time: 57.3ms (= 11468.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.5ms (= 15507.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.35 (= 77.5ms / 57.3ms)
OneFlow resnet50 time: 44.0ms (= 8794.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.8ms (= 15751.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.79 (= 78.8ms / 44.0ms)
OneFlow resnet50 time: 41.3ms (= 8256.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.7ms (= 13545.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.64 (= 67.7ms / 41.3ms)
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9492/
如果确认需要在环境中使用RDMA,则需要保证RDMA init在DataLoader创建之后。由于先初始化DataLoader(初始化时即创建worker进程),后使用RDMA即可避免问题;
看起来可以在 dataloader 初始化时加一个提示,如果准备进行多 worker fork 进程,且 rdma 已经初始化,就打印提示,有两种方式:
- 报 error
- 报 warning
0.8.1+cu117.git.d20994afe4和0.8.1.dev20221204+cu112的bert loss对比,0.8.1+cu117.git.d20994afe4上打开了rdma_enabled
version: 0.8.1+cu117.git.83ca41036d(4卡数据并行训练开启rdma) git_commit: 83ca41036d cmake_build_type: RelWithDebInfo rdma: True mlir: False 和 version: 0.8.1.dev20221204+cu112(4卡数据并行训练关闭rdma) git_commit: ad20365 cmake_build_type: Release rdma: True mlir: True
-
200个iter:
-
500个iter:
-
1000个iter
-
3000个iter
@Flowingsun007
Speed stats:
GPU Name: GeForce GTX 1080
❌ OneFlow resnet50 time: 139.6ms (= 13963.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.4ms (= 16143.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 161.4ms / 139.6ms)
OneFlow resnet50 time: 85.4ms (= 8544.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.2ms (= 10117.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 101.2ms / 85.4ms)
OneFlow resnet50 time: 58.2ms (= 11638.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.4ms (= 15271.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.31 (= 76.4ms / 58.2ms)
OneFlow resnet50 time: 44.2ms (= 8835.4ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.6ms (= 14524.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.64 (= 72.6ms / 44.2ms)
OneFlow resnet50 time: 41.8ms (= 8365.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.6ms (= 14313.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.71 (= 71.6ms / 41.8ms)
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9492/