Global mode
支持: https://github.com/Oneflow-Inc/OneTeam/issues/1792
支持创建一个 global 的 context,可以设置开关、placement、sbp,在 global context 下:
- [x] 支持 GlobalTensor.device
- [x] 支持 GlobalTensor.to(device)
- [ ] 支持 src op 如 randn 创建时,可以直接创建出 global tensor,其 placement 和 sbp 可以从 global context 中获取
- [x] randn
- [x] empty
- [x] tensor
- [x] arange
- [ ] 其它后面的 pr 再一起补充
如此可以把 module 的 forward 中的 local 逻辑非入侵的转成 global 的,主要支持 ddp 那种数据并行的执行方式。
支持创建一个 global 的 context,可以设置开关、placement、sbp,在 global context 下:
- [x] 支持 GlobalTensor.device
- [x] 支持 GlobalTensor.to(device)
- [ ] 支持 src op 如 randn 创建时,可以直接创建出 global tensor,其 placement 和 sbp 可以从 global context 中获取
- [x] randn
Speed stats:
GPU Name: GeForce GTX 1080
❌ OneFlow resnet50 time: 141.6ms (= 14158.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.5ms (= 16149.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.14 (= 161.5ms / 141.6ms)
OneFlow resnet50 time: 85.9ms (= 8591.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.3ms (= 10226.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 102.3ms / 85.9ms)
OneFlow resnet50 time: 58.4ms (= 11671.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.6ms (= 15511.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 77.6ms / 58.4ms)
OneFlow resnet50 time: 44.4ms (= 8881.5ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.4ms (= 13872.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.56 (= 69.4ms / 44.4ms)
OneFlow resnet50 time: 39.3ms (= 7866.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.9ms (= 13584.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.73 (= 67.9ms / 39.3ms)
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9470/
目前stable diffusion遇到的需要转换的算子:
flow.randn()
flow.empty()
flow.tensor()
flow.arange()
Speed stats:
GPU Name: GeForce GTX 1080
❌ OneFlow resnet50 time: 141.5ms (= 14147.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 164.6ms (= 16458.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 164.6ms / 141.5ms)
OneFlow resnet50 time: 86.0ms (= 8603.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.8ms (= 10275.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 102.8ms / 86.0ms)
OneFlow resnet50 time: 58.2ms (= 11647.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.2ms (= 15645.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.34 (= 78.2ms / 58.2ms)
OneFlow resnet50 time: 45.0ms (= 8998.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.0ms (= 14796.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.64 (= 74.0ms / 45.0ms)
OneFlow resnet50 time: 39.6ms (= 7916.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.0ms (= 13592.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.72 (= 68.0ms / 39.6ms)
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9470/
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
支持测试的方式:
在 oneflow/python/oneflow/test/graph 目录
python3 -m oneflow.distributed.launch --nproc_per_node 2 ./test_graph_with_global.py --failfast --verbose
Speed stats:
GPU Name: GeForce GTX 1080
❌ OneFlow resnet50 time: 141.3ms (= 14128.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.7ms (= 16072.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.14 (= 160.7ms / 141.3ms)
OneFlow resnet50 time: 87.2ms (= 8724.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 101.6ms (= 10161.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 101.6ms / 87.2ms)
OneFlow resnet50 time: 58.8ms (= 11763.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.9ms (= 15386.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.31 (= 76.9ms / 58.8ms)
OneFlow resnet50 time: 45.8ms (= 9164.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.7ms (= 13943.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.52 (= 69.7ms / 45.8ms)
OneFlow resnet50 time: 40.6ms (= 8113.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 53.2ms (= 10643.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.31 (= 53.2ms / 40.6ms)
Speed stats:
GPU Name: GeForce GTX 1080
❌ OneFlow resnet50 time: 140.6ms (= 14056.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 163.9ms (= 16386.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 163.9ms / 140.6ms)
OneFlow resnet50 time: 85.5ms (= 8545.2ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 111.5ms (= 11146.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.30 (= 111.5ms / 85.5ms)
OneFlow resnet50 time: 58.0ms (= 11592.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 88.0ms (= 17593.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.52 (= 88.0ms / 58.0ms)
OneFlow resnet50 time: 45.2ms (= 9034.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.4ms (= 14484.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.60 (= 72.4ms / 45.2ms)
OneFlow resnet50 time: 40.6ms (= 8124.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 86.8ms (= 17350.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 2.14 (= 86.8ms / 40.6ms)
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
Speed stats:
GPU Name: GeForce GTX 1080
❌ OneFlow resnet50 time: 140.8ms (= 14081.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 163.7ms (= 16369.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 163.7ms / 140.8ms)
OneFlow resnet50 time: 86.3ms (= 8633.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.0ms (= 10203.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 102.0ms / 86.3ms)
OneFlow resnet50 time: 57.9ms (= 11587.5ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.2ms (= 15433.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 77.2ms / 57.9ms)
OneFlow resnet50 time: 43.9ms (= 8775.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.6ms (= 14310.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.63 (= 71.6ms / 43.9ms)
OneFlow resnet50 time: 40.3ms (= 8050.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 73.5ms (= 14706.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.83 (= 73.5ms / 40.3ms)
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9470/
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
Speed stats:
GPU Name: GeForce GTX 1080
❌ OneFlow resnet50 time: 141.2ms (= 14119.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 164.8ms (= 16484.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 164.8ms / 141.2ms)
OneFlow resnet50 time: 86.7ms (= 8668.8ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 103.1ms (= 10306.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 103.1ms / 86.7ms)
OneFlow resnet50 time: 58.1ms (= 11622.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.8ms (= 15767.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.36 (= 78.8ms / 58.1ms)
OneFlow resnet50 time: 44.1ms (= 8829.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.2ms (= 15838.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.79 (= 79.2ms / 44.1ms)
OneFlow resnet50 time: 40.4ms (= 8081.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.0ms (= 13606.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.68 (= 68.0ms / 40.4ms)
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
Speed stats:
GPU Name: GeForce GTX 1080
❌ OneFlow resnet50 time: 139.7ms (= 13970.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.9ms (= 16089.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 160.9ms / 139.7ms)
OneFlow resnet50 time: 85.1ms (= 8506.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 106.6ms (= 10660.5ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.25 (= 106.6ms / 85.1ms)
OneFlow resnet50 time: 57.5ms (= 11491.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.5ms (= 15508.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.35 (= 77.5ms / 57.5ms)
OneFlow resnet50 time: 46.0ms (= 9196.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 65.6ms (= 13114.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.43 (= 65.6ms / 46.0ms)
OneFlow resnet50 time: 39.9ms (= 7970.0ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.4ms (= 13678.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.72 (= 68.4ms / 39.9ms)
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9470/
CI failed when running job: cpu-module. PR label automerge has been removed
Speed stats:
GPU Name: GeForce GTX 1080
❌ OneFlow resnet50 time: 140.3ms (= 14026.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 162.6ms (= 16258.4ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.16 (= 162.6ms / 140.3ms)
OneFlow resnet50 time: 84.8ms (= 8480.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 111.5ms (= 11150.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.31 (= 111.5ms / 84.8ms)
OneFlow resnet50 time: 57.8ms (= 11559.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.9ms (= 15588.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.35 (= 77.9ms / 57.8ms)
OneFlow resnet50 time: 43.8ms (= 8766.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.8ms (= 14553.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.66 (= 72.8ms / 43.8ms)
OneFlow resnet50 time: 41.8ms (= 8352.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.9ms (= 14586.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.75 (= 72.9ms / 41.8ms)
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9470/
CI failed when running job: cuda-misc. PR label automerge has been removed