oneflow
oneflow copied to clipboard
Record and print Python stack even in background thread
为了解决 one yolo 遇到的现象(“不拿放大镜看不到out of memory”),做了这些事情:
- 添加 ForeignStackGetter 类和 Python StackGetter 子类(用 C++ 实现)来获取 Python 栈
- 给每条指令添加了 frame 成员,包含指令构造时的 Python 栈
- 在 CHECK_JUST、CHECK_JUST_MSG 出错时,抛出异常而不是 LOG(FATAL),异常携带的信息包含原始错误信息以及当前指令对应的 Python 栈,并且为了增进可读性加了颜色
效果对比:
本 PR:
master:
这个 PR 出发点是为了解决 one yolo 遇到的现象(“不拿放大镜看不到out of memory”)
这个不错
重构了这个 PR,通过自定义 python 解释器的 tstate->interp->eval_frame 函数来维护调用栈
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
https://github.com/Oneflow-Inc/oneflow/commit/684b0a43b5cb9ca5e698a32c04a0cb90e0340f12 编译有报错 https://github.com/Oneflow-Inc/oneflow/actions/runs/3286674212
ninja: build stopped: subcommand failed.
Error: {"ID":"0f3c12d8ef0a3ba46d79b1edd45ecf156129ab222dfae70cdc2bfaa937b55462","Running":false,"ExitCode":1,"ProcessConfig":{"tty":false,"entrypoint":"bash","arguments":["-l","/home/ci-user/runners/release/_work/oneflow/oneflow/ci/manylinux/build-gcc7.sh"],"privileged":false},"OpenStdin":false,"OpenStderr":true,"OpenStdout":true,"CanRemove":false,"ContainerID":"415a31872fb6d3610161033266276f17e23a7258abc57a2e18ed6353f33c1541","DetachKeys":"","Pid":2822682}
@daquexian @lixinqi
684b0a4 编译有报错 https://github.com/Oneflow-Inc/oneflow/actions/runs/3286674212
ninja: build stopped: subcommand failed. Error: {"ID":"0f3c12d8ef0a3ba46d79b1edd45ecf156129ab222dfae70cdc2bfaa937b55462","Running":false,"ExitCode":1,"ProcessConfig":{"tty":false,"entrypoint":"bash","arguments":["-l","/home/ci-user/runners/release/_work/oneflow/oneflow/ci/manylinux/build-gcc7.sh"],"privileged":false},"OpenStdin":false,"OpenStderr":true,"OpenStdout":true,"CanRemove":false,"ContainerID":"415a31872fb6d3610161033266276f17e23a7258abc57a2e18ed6353f33c1541","DetachKeys":"","Pid":2822682}
@daquexian @lixinqi
/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/api/python/custom_eval_frame.c:25:10: fatal error: internal/pycore_pystate.h: No such file or directory #include "internal/pycore_pystate.h"
看起来像是cmake里没有处理好相关的依赖。 @daquexian
/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/api/python/custom_eval_frame.c:25:10: fatal error: internal/pycore_pystate.h: No such file or directory #include "internal/pycore_pystate.h"
看起来像是cmake里没有处理好相关的依赖。 @daquexian
嗯嗯我看下
684b0a4 编译有报错 https://github.com/Oneflow-Inc/oneflow/actions/runs/3286674212
ninja: build stopped: subcommand failed. Error: {"ID":"0f3c12d8ef0a3ba46d79b1edd45ecf156129ab222dfae70cdc2bfaa937b55462","Running":false,"ExitCode":1,"ProcessConfig":{"tty":false,"entrypoint":"bash","arguments":["-l","/home/ci-user/runners/release/_work/oneflow/oneflow/ci/manylinux/build-gcc7.sh"],"privileged":false},"OpenStdin":false,"OpenStderr":true,"OpenStdout":true,"CanRemove":false,"ContainerID":"415a31872fb6d3610161033266276f17e23a7258abc57a2e18ed6353f33c1541","DetachKeys":"","Pid":2822682}
@daquexian @lixinqi
/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/api/python/custom_eval_frame.c:25:10: fatal error: internal/pycore_pystate.h: No such file or directory #include "internal/pycore_pystate.h"
看起来像是cmake里没有处理好相关的依赖。 @daquexian
嗯嗯我看下
我感觉自定义栈的主体逻辑和oneflow没有关系,是不是相关的代码都单独成一个目录呢?甚至放到tools里
我感觉自定义栈的主体逻辑和oneflow没有关系,是不是相关的代码都单独成一个目录呢?甚至放到tools里
放在 oneflow/extension/<language>/stack/
里怎么样,我感觉挺合适的
bug 已修复,是因为调用 Py_DECREF 时没有拿 gil 锁导致的
eager测试
- 机器: oneflow28 NVIDIA GeForce RTX 3080 Ti
- 磁盘:ssd
Case | check_xx@2a23745 | master@a3841f5 |
---|---|---|
ResNet50_DCcpu_FP32_mb96_gb96_acc1_1n1g | 390.69 / 11669 MiB | 398.05 / 11669 MiB |
bert_large_pretrain_eager_nl24_nah16_hs1024_FP32_ acfalse_DP1_MP1_PP1_zerofalse_stage0_mbs1_gbs1_acc1_1n1g |
6.73 samples/s / 8703 MiB | 10.91 samples/s / 8575 MiB |
- ResNet50 下有2%左右的下降
- libai下有40%下降,显存有所增加
- 以上测试进行过多次。
一键复现:
wget https://raw.githubusercontent.com/Oneflow-Inc/OneAutoTest/main/onebench/libai/bert/run.sh
直接bash run.sh
和 bash run.sh check_xx
ResNet50 下有2%左右的下降 libai下有40%下降,显存有所增加 以上测试进行过多次。
好!看来是实现还有些问题,我再调整下
eager测试
- 机器: oneflow28 NVIDIA GeForce RTX 3080 Ti
- 磁盘:ssd
Case | check_xx@aa489bc | master@a3841f5 |
---|---|---|
ResNet50_DCcpu_FP32_mb96_gb96_acc1_1n1g | 399.62 / 11669 MiB | 398.05 / 11669 MiB |
bert_large_pretrain_eager_nl24_nah16_hs1024_FP32_ acfalse_DP1_MP1_PP1_zerofalse_stage0_mbs1_gbs1_acc1_1n1g |
10.96 samples/s / 8703 MiB | 10.91 samples/s / 8575 MiB |
- 测试没啥问题了。
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
为了稳妥起见,加了一个 ONEFLOW_PYTHON_STACK_GETTER 环境变量,需要用户显式打开
Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.
CI failed when running job: cpu-misc. PR label automerge has been removed
Speed stats:
Speed stats:
GPU Name: GeForce GTX 1080
❌ OneFlow resnet50 time: 139.8ms (= 13976.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.0ms (= 16100.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 161.0ms / 139.8ms)
OneFlow resnet50 time: 85.4ms (= 8538.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 111.3ms (= 11126.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.30 (= 111.3ms / 85.4ms)
OneFlow resnet50 time: 57.5ms (= 11492.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 88.8ms (= 17760.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.55 (= 88.8ms / 57.5ms)
OneFlow resnet50 time: 43.8ms (= 8763.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.2ms (= 14045.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.60 (= 70.2ms / 43.8ms)
OneFlow resnet50 time: 39.3ms (= 7866.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 45.3ms (= 9069.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 45.3ms / 39.3ms)
View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8864/