oneflow
oneflow copied to clipboard
Add double grad for broadcast_matmul_grad_b op
- 添加 broadcast_matmul_grad_b 二阶导算子
- 矩阵乘法 a * b
- 对 b 的梯度计算由 broadcast_matmul_grad_b 计算,需要另外实现二阶导算子,
- 对 a 的梯度计算由 matmul 计算,已闭包,不需要另外实现二阶导算子(broadcast 时还需要 reduce_sum_like 的一阶导 https://github.com/Oneflow-Inc/oneflow/issues/8831 )
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
CI failed when running job: cuda-benchmark. PR label automerge has been removed
Speed stats:
Speed stats:
GPU Name: GeForce GTX 1080
✔️ OneFlow resnet50 time: 128.4ms (= 12838.0ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.5ms (= 14254.9ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.11 (= 142.5ms / 128.4ms)
OneFlow resnet50 time: 75.5ms (= 7550.7ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 86.7ms (= 8668.0ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.15 (= 86.7ms / 75.5ms)
OneFlow resnet50 time: 48.4ms (= 9679.1ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.2ms (= 11632.3ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.20 (= 58.2ms / 48.4ms)
OneFlow resnet50 time: 36.0ms (= 7208.3ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.8ms (= 8955.5ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.24 (= 44.8ms / 36.0ms)
OneFlow resnet50 time: 28.2ms (= 5635.5ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 36.5ms (= 7291.3ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.29 (= 36.5ms / 28.2ms)
OneFlow swin dataloader time: 0.261s (= 52.299s / 200, num_workers=1)
PyTorch swin dataloader time: 0.150s (= 30.083s / 200, num_workers=1)
Relative speed: 0.575 (= 0.150s / 0.261s)
OneFlow swin dataloader time: 0.073s (= 14.658s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.102s / 200, num_workers=4)
Relative speed: 0.553 (= 0.041s / 0.073s)
OneFlow swin dataloader time: 0.059s (= 11.889s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.328s / 200, num_workers=8)
Relative speed: 0.364 (= 0.022s / 0.059s)
❌ OneFlow resnet50 time: 136.6ms (= 13660.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.6ms (= 16064.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 160.6ms / 136.6ms)
OneFlow resnet50 time: 84.8ms (= 8481.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.1ms (= 10205.6ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 102.1ms / 84.8ms)
OneFlow resnet50 time: 58.0ms (= 11607.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 80.5ms (= 16095.8ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.39 (= 80.5ms / 58.0ms)
OneFlow resnet50 time: 45.5ms (= 9105.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 69.8ms (= 13952.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.53 (= 69.8ms / 45.5ms)
OneFlow resnet50 time: 38.9ms (= 7773.7ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 66.1ms (= 13224.2ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.70 (= 66.1ms / 38.9ms)
CI failed when running job: cuda-misc. PR label automerge has been removed
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8844/
Speed stats:
GPU Name: GeForce GTX 1080
✔️ OneFlow resnet50 time: 128.7ms (= 12865.8ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 141.7ms (= 14171.8ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.10 (= 141.7ms / 128.7ms)
OneFlow resnet50 time: 75.7ms (= 7573.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.6ms (= 8456.5ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.12 (= 84.6ms / 75.7ms)
OneFlow resnet50 time: 49.1ms (= 9821.1ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 56.4ms (= 11275.9ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.15 (= 56.4ms / 49.1ms)
OneFlow resnet50 time: 36.4ms (= 7277.5ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 41.1ms (= 8223.6ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.13 (= 41.1ms / 36.4ms)
OneFlow resnet50 time: 28.3ms (= 5664.3ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 43.3ms (= 8663.1ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.53 (= 43.3ms / 28.3ms)
OneFlow swin dataloader time: 0.254s (= 50.834s / 200, num_workers=1)
PyTorch swin dataloader time: 0.148s (= 29.676s / 200, num_workers=1)
Relative speed: 0.584 (= 0.148s / 0.254s)
OneFlow swin dataloader time: 0.071s (= 14.236s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.213s / 200, num_workers=4)
Relative speed: 0.577 (= 0.041s / 0.071s)
OneFlow swin dataloader time: 0.038s (= 7.673s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.358s / 200, num_workers=8)
Relative speed: 0.568 (= 0.022s / 0.038s)
❌ OneFlow resnet50 time: 136.9ms (= 13686.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.2ms (= 16021.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.17 (= 160.2ms / 136.9ms)
OneFlow resnet50 time: 84.8ms (= 8479.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.5ms (= 10245.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.21 (= 102.5ms / 84.8ms)
OneFlow resnet50 time: 58.6ms (= 11727.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.5ms (= 15701.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.34 (= 78.5ms / 58.6ms)
OneFlow resnet50 time: 45.2ms (= 9044.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.3ms (= 14264.1ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.58 (= 71.3ms / 45.2ms)
OneFlow resnet50 time: 39.1ms (= 7813.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.2ms (= 15432.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.98 (= 77.2ms / 39.1ms)
CI failed when running job: cpu-misc. PR label automerge has been removed
Speed stats:
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8844/
Speed stats:
GPU Name: GeForce GTX 1080
✔️ OneFlow resnet50 time: 128.3ms (= 12833.0ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 142.7ms (= 14269.3ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.11 (= 142.7ms / 128.3ms)
OneFlow resnet50 time: 75.4ms (= 7542.7ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 84.3ms (= 8429.3ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.12 (= 84.3ms / 75.4ms)
OneFlow resnet50 time: 48.3ms (= 9667.7ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 60.0ms (= 12007.5ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.24 (= 60.0ms / 48.3ms)
OneFlow resnet50 time: 36.1ms (= 7222.7ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 44.0ms (= 8800.7ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.22 (= 44.0ms / 36.1ms)
OneFlow resnet50 time: 28.6ms (= 5714.5ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 37.4ms (= 7485.5ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.31 (= 37.4ms / 28.6ms)
OneFlow swin dataloader time: 0.257s (= 51.381s / 200, num_workers=1)
PyTorch swin dataloader time: 0.148s (= 29.600s / 200, num_workers=1)
Relative speed: 0.576 (= 0.148s / 0.257s)
OneFlow swin dataloader time: 0.071s (= 14.258s / 200, num_workers=4)
PyTorch swin dataloader time: 0.040s (= 8.082s / 200, num_workers=4)
Relative speed: 0.567 (= 0.040s / 0.071s)
OneFlow swin dataloader time: 0.038s (= 7.699s / 200, num_workers=8)
PyTorch swin dataloader time: 0.022s (= 4.420s / 200, num_workers=8)
Relative speed: 0.574 (= 0.022s / 0.038s)
❌ OneFlow resnet50 time: 136.5ms (= 13645.5ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 167.2ms (= 16723.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.23 (= 167.2ms / 136.5ms)
OneFlow resnet50 time: 84.3ms (= 8430.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 111.9ms (= 11186.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 111.9ms / 84.3ms)
OneFlow resnet50 time: 57.8ms (= 11553.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.2ms (= 15835.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.37 (= 79.2ms / 57.8ms)
OneFlow resnet50 time: 44.8ms (= 8967.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 80.5ms (= 16094.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.79 (= 80.5ms / 44.8ms)
OneFlow resnet50 time: 38.7ms (= 7744.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 79.6ms (= 15925.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 2.06 (= 79.6ms / 38.7ms)