oneflow
oneflow copied to clipboard
fix bug in cosine_similarity when inputs have different dims
背景:https://github.com/Oneflow-Inc/OneCloud/issues/136#issuecomment-1194951405 问题概述:cosine_similarity 中,当两个输入的 dims 不同时,会报错。
PR的实现: 在cosine_similarity的实现中,当两个输入的shape不同时,会遍历两个shape,来生成一个max_shape,然后将两个输入都expand成该max_shape再进行后续计算。这里遍历的时候,下标写错了,导致访问越界。
torch的实现: cosine_similarity的处理: https://github.com/pytorch/pytorch/blob/62c8d30f9f6715d0b60d78fb5f5913a2f3bd185b/aten/src/ATen/native/Distance.cpp#L275-L280 生成max_shape的处理: https://github.com/pytorch/pytorch/blob/62c8d30f9f6715d0b60d78fb5f5913a2f3bd185b/aten/src/ATen/ExpandUtils.cpp#L17-L43
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8902/
Speed stats:
GPU Name: GeForce GTX 1080
✔️ OneFlow resnet50 time: 128.5ms (= 12854.1ms / 100, input_shape=[16, 3, 224, 224])
PyTorch resnet50 time: 150.3ms (= 15032.9ms / 100, input_shape=[16, 3, 224, 224])
✔️ Relative speed: 1.17 (= 150.3ms / 128.5ms)
OneFlow resnet50 time: 75.4ms (= 7537.5ms / 100, input_shape=[8, 3, 224, 224])
PyTorch resnet50 time: 85.3ms (= 8525.5ms / 100, input_shape=[8, 3, 224, 224])
✔️ Relative speed: 1.13 (= 85.3ms / 75.4ms)
OneFlow resnet50 time: 48.5ms (= 9690.4ms / 200, input_shape=[4, 3, 224, 224])
PyTorch resnet50 time: 58.9ms (= 11788.4ms / 200, input_shape=[4, 3, 224, 224])
✔️ Relative speed: 1.22 (= 58.9ms / 48.5ms)
OneFlow resnet50 time: 36.0ms (= 7199.5ms / 200, input_shape=[2, 3, 224, 224])
PyTorch resnet50 time: 39.8ms (= 7956.0ms / 200, input_shape=[2, 3, 224, 224])
✔️ Relative speed: 1.11 (= 39.8ms / 36.0ms)
OneFlow resnet50 time: 28.2ms (= 5639.6ms / 200, input_shape=[1, 3, 224, 224])
PyTorch resnet50 time: 39.7ms (= 7942.7ms / 200, input_shape=[1, 3, 224, 224])
✔️ Relative speed: 1.41 (= 39.7ms / 28.2ms)
OneFlow swin dataloader time: 0.272s (= 54.335s / 200, num_workers=1)
PyTorch swin dataloader time: 0.150s (= 30.005s / 200, num_workers=1)
Relative speed: 0.552 (= 0.150s / 0.272s)
OneFlow swin dataloader time: 0.075s (= 15.011s / 200, num_workers=4)
PyTorch swin dataloader time: 0.041s (= 8.220s / 200, num_workers=4)
Relative speed: 0.548 (= 0.041s / 0.075s)
OneFlow swin dataloader time: 0.040s (= 7.973s / 200, num_workers=8)
PyTorch swin dataloader time: 0.021s (= 4.225s / 200, num_workers=8)
Relative speed: 0.530 (= 0.021s / 0.040s)
❌ OneFlow resnet50 time: 136.6ms (= 13660.8ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.7ms (= 16173.0ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.18 (= 161.7ms / 136.6ms)
OneFlow resnet50 time: 84.3ms (= 8434.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 102.4ms (= 10236.9ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.21 (= 102.4ms / 84.3ms)
OneFlow resnet50 time: 57.8ms (= 11551.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 78.5ms (= 15705.3ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.36 (= 78.5ms / 57.8ms)
OneFlow resnet50 time: 45.3ms (= 9057.8ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 71.2ms (= 14247.5ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.57 (= 71.2ms / 45.3ms)
OneFlow resnet50 time: 39.3ms (= 7856.8ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 74.8ms (= 14961.4ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.90 (= 74.8ms / 39.3ms)
CI failed when running job: cuda-misc. PR label automerge has been removed
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8902/