oneflow icon indicating copy to clipboard operation
oneflow copied to clipboard

Fix loss scale precision

Open leaves-zwx opened this issue 3 years ago • 4 comments

  • 修复 loss scale 的不恰当精度转换
  • 添加 amp_black_identity 以便在恰当时刻控制 gray node 的精度

leaves-zwx avatar Sep 21 '22 17:09 leaves-zwx

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 140.5ms (= 14045.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 160.4ms (= 16041.2ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.14 (= 160.4ms / 140.5ms)

OneFlow resnet50 time: 85.9ms (= 8594.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.3ms (= 10229.1ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 102.3ms / 85.9ms)

OneFlow resnet50 time: 58.4ms (= 11680.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.9ms (= 15583.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.33 (= 77.9ms / 58.4ms)

OneFlow resnet50 time: 45.1ms (= 9018.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 80.2ms (= 16040.7ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.78 (= 80.2ms / 45.1ms)

OneFlow resnet50 time: 41.0ms (= 8206.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 67.5ms (= 13503.9ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.65 (= 67.5ms / 41.0ms)

github-actions[bot] avatar Sep 22 '22 13:09 github-actions[bot]

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9126/

github-actions[bot] avatar Sep 22 '22 15:09 github-actions[bot]

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 140.6ms (= 14056.7ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 167.6ms (= 16760.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 167.6ms / 140.6ms)

OneFlow resnet50 time: 85.8ms (= 8578.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.2ms (= 10218.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.19 (= 102.2ms / 85.8ms)

OneFlow resnet50 time: 58.1ms (= 11619.1ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 87.1ms (= 17430.0ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.50 (= 87.1ms / 58.1ms)

OneFlow resnet50 time: 44.4ms (= 8880.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.7ms (= 14935.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.68 (= 74.7ms / 44.4ms)

OneFlow resnet50 time: 41.4ms (= 8285.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 77.9ms (= 15587.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.88 (= 77.9ms / 41.4ms)

github-actions[bot] avatar Sep 22 '22 16:09 github-actions[bot]

CI failed when running job: cuda-misc. PR label automerge has been removed

github-actions[bot] avatar Sep 22 '22 16:09 github-actions[bot]

test_graph_zero.py 这个单测 aborted 了,原因是 https://github.com/Oneflow-Inc/oneflow/blob/7c3e9a3bd6684fc2bcef960135c9e60f45d21204/oneflow/core/job_rewriter/autograd.cpp#L776-L777 这里 loss_diff 与 loss_scale 相乘的时候,dtype 不匹配。

原因是 https://github.com/Oneflow-Inc/oneflow/blob/7c3e9a3bd6684fc2bcef960135c9e60f45d21204/oneflow/core/job_rewriter/autograd.cpp#L738 这里,当配置 static loss scale 时,产生的 loss scale 与 loss diff 都是 float16,也就不需要再 cast loss diff 了。

但我这里有个问题,static loss scale 对应的 constant_like_op 适合用 float16 作为输出吗?如果配置的值超过 float16 表示范围怎么办?如果想设为 float32,那 constant_like 的 like 对象应该是谁呢?loss 和 loss diff 本身是 float16 的。

leaves-zwx avatar Sep 23 '22 08:09 leaves-zwx

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 140.0ms (= 14004.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.6ms (= 16164.1ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 161.6ms / 140.0ms)

OneFlow resnet50 time: 85.6ms (= 8556.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.8ms (= 10278.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.20 (= 102.8ms / 85.6ms)

OneFlow resnet50 time: 58.7ms (= 11733.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 82.9ms (= 16588.7ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.41 (= 82.9ms / 58.7ms)

OneFlow resnet50 time: 45.0ms (= 9002.0ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.8ms (= 14152.2ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.57 (= 70.8ms / 45.0ms)

OneFlow resnet50 time: 41.4ms (= 8277.1ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 76.1ms (= 15216.5ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.84 (= 76.1ms / 41.4ms)

github-actions[bot] avatar Sep 23 '22 12:09 github-actions[bot]

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/9126/

github-actions[bot] avatar Sep 23 '22 12:09 github-actions[bot]

CI failed when running job: cpu-misc. PR label automerge has been removed

github-actions[bot] avatar Sep 23 '22 13:09 github-actions[bot]

ci 测试好像在 cpu 版本 oneflow 中发生了死锁,这可能是什么原因呢?

leaves-zwx avatar Sep 23 '22 15:09 leaves-zwx

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 140.4ms (= 14040.6ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.6ms (= 16159.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 161.6ms / 140.4ms)

OneFlow resnet50 time: 85.6ms (= 8565.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 111.2ms (= 11124.0ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.30 (= 111.2ms / 85.6ms)

OneFlow resnet50 time: 58.1ms (= 11617.6ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.5ms (= 15693.4ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.35 (= 78.5ms / 58.1ms)

OneFlow resnet50 time: 44.9ms (= 8976.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 72.0ms (= 14391.6ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.60 (= 72.0ms / 44.9ms)

OneFlow resnet50 time: 39.6ms (= 7914.6ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 68.4ms (= 13685.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.73 (= 68.4ms / 39.6ms)

github-actions[bot] avatar Sep 23 '22 16:09 github-actions[bot]