oneflow icon indicating copy to clipboard operation
oneflow copied to clipboard

Record and print Python stack even in background thread

Open daquexian opened this issue 2 years ago • 14 comments

为了解决 one yolo 遇到的现象(“不拿放大镜看不到out of memory”),做了这些事情:

  1. 添加 ForeignStackGetter 类和 Python StackGetter 子类(用 C++ 实现)来获取 Python 栈
  2. 给每条指令添加了 frame 成员,包含指令构造时的 Python 栈
  3. 在 CHECK_JUST、CHECK_JUST_MSG 出错时,抛出异常而不是 LOG(FATAL),异常携带的信息包含原始错误信息以及当前指令对应的 Python 栈,并且为了增进可读性加了颜色

效果对比:

本 PR: image

master: image

daquexian avatar Aug 07 '22 09:08 daquexian

这个 PR 出发点是为了解决 one yolo 遇到的现象(“不拿放大镜看不到out of memory”)

daquexian avatar Aug 07 '22 10:08 daquexian

这个不错

yuanms2 avatar Aug 07 '22 10:08 yuanms2

重构了这个 PR,通过自定义 python 解释器的 tstate->interp->eval_frame 函数来维护调用栈

daquexian avatar Oct 19 '22 05:10 daquexian

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions[bot] avatar Oct 19 '22 05:10 github-actions[bot]

https://github.com/Oneflow-Inc/oneflow/commit/684b0a43b5cb9ca5e698a32c04a0cb90e0340f12 编译有报错 https://github.com/Oneflow-Inc/oneflow/actions/runs/3286674212

ninja: build stopped: subcommand failed.
Error: {"ID":"0f3c12d8ef0a3ba46d79b1edd45ecf156129ab222dfae70cdc2bfaa937b55462","Running":false,"ExitCode":1,"ProcessConfig":{"tty":false,"entrypoint":"bash","arguments":["-l","/home/ci-user/runners/release/_work/oneflow/oneflow/ci/manylinux/build-gcc7.sh"],"privileged":false},"OpenStdin":false,"OpenStderr":true,"OpenStdout":true,"CanRemove":false,"ContainerID":"415a31872fb6d3610161033266276f17e23a7258abc57a2e18ed6353f33c1541","DetachKeys":"","Pid":2822682}

@daquexian @lixinqi

xyn1201 avatar Oct 20 '22 05:10 xyn1201

684b0a4 编译有报错 https://github.com/Oneflow-Inc/oneflow/actions/runs/3286674212

ninja: build stopped: subcommand failed.
Error: {"ID":"0f3c12d8ef0a3ba46d79b1edd45ecf156129ab222dfae70cdc2bfaa937b55462","Running":false,"ExitCode":1,"ProcessConfig":{"tty":false,"entrypoint":"bash","arguments":["-l","/home/ci-user/runners/release/_work/oneflow/oneflow/ci/manylinux/build-gcc7.sh"],"privileged":false},"OpenStdin":false,"OpenStderr":true,"OpenStdout":true,"CanRemove":false,"ContainerID":"415a31872fb6d3610161033266276f17e23a7258abc57a2e18ed6353f33c1541","DetachKeys":"","Pid":2822682}

@daquexian @lixinqi

/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/api/python/custom_eval_frame.c:25:10: fatal error: internal/pycore_pystate.h: No such file or directory #include "internal/pycore_pystate.h"

看起来像是cmake里没有处理好相关的依赖。 @daquexian

lixinqi avatar Oct 20 '22 05:10 lixinqi

/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/api/python/custom_eval_frame.c:25:10: fatal error: internal/pycore_pystate.h: No such file or directory #include "internal/pycore_pystate.h"

看起来像是cmake里没有处理好相关的依赖。 @daquexian

嗯嗯我看下

daquexian avatar Oct 20 '22 06:10 daquexian

684b0a4 编译有报错 https://github.com/Oneflow-Inc/oneflow/actions/runs/3286674212

ninja: build stopped: subcommand failed.
Error: {"ID":"0f3c12d8ef0a3ba46d79b1edd45ecf156129ab222dfae70cdc2bfaa937b55462","Running":false,"ExitCode":1,"ProcessConfig":{"tty":false,"entrypoint":"bash","arguments":["-l","/home/ci-user/runners/release/_work/oneflow/oneflow/ci/manylinux/build-gcc7.sh"],"privileged":false},"OpenStdin":false,"OpenStderr":true,"OpenStdout":true,"CanRemove":false,"ContainerID":"415a31872fb6d3610161033266276f17e23a7258abc57a2e18ed6353f33c1541","DetachKeys":"","Pid":2822682}

@daquexian @lixinqi

/home/ci-user/runners/release/_work/oneflow/oneflow/oneflow/api/python/custom_eval_frame.c:25:10: fatal error: internal/pycore_pystate.h: No such file or directory #include "internal/pycore_pystate.h"

看起来像是cmake里没有处理好相关的依赖。 @daquexian

嗯嗯我看下

daquexian avatar Oct 20 '22 06:10 daquexian

我感觉自定义栈的主体逻辑和oneflow没有关系,是不是相关的代码都单独成一个目录呢?甚至放到tools里

lixinqi avatar Oct 24 '22 02:10 lixinqi

我感觉自定义栈的主体逻辑和oneflow没有关系,是不是相关的代码都单独成一个目录呢?甚至放到tools里

放在 oneflow/extension/<language>/stack/ 里怎么样,我感觉挺合适的

daquexian avatar Oct 24 '22 02:10 daquexian

bug 已修复,是因为调用 Py_DECREF 时没有拿 gil 锁导致的

daquexian avatar Nov 01 '22 12:11 daquexian

eager测试

  • 机器: oneflow28 NVIDIA GeForce RTX 3080 Ti
  • 磁盘:ssd
Case check_xx@2a23745 master@a3841f5
ResNet50_DCcpu_FP32_mb96_gb96_acc1_1n1g 390.69 / 11669 MiB 398.05 / 11669 MiB
bert_large_pretrain_eager_nl24_nah16_hs1024_FP32_
acfalse_DP1_MP1_PP1_zerofalse_stage0_mbs1_gbs1_acc1_1n1g
6.73 samples/s / 8703 MiB 10.91 samples/s / 8575 MiB
  • ResNet50 下有2%左右的下降
  • libai下有40%下降,显存有所增加
  • 以上测试进行过多次。

一键复现: wget https://raw.githubusercontent.com/Oneflow-Inc/OneAutoTest/main/onebench/libai/bert/run.sh 直接bash run.shbash run.sh check_xx

ouyangyu avatar Nov 02 '22 09:11 ouyangyu

ResNet50 下有2%左右的下降 libai下有40%下降,显存有所增加 以上测试进行过多次。

好!看来是实现还有些问题,我再调整下

daquexian avatar Nov 02 '22 09:11 daquexian

eager测试

  • 机器: oneflow28 NVIDIA GeForce RTX 3080 Ti
  • 磁盘:ssd
Case check_xx@aa489bc master@a3841f5
ResNet50_DCcpu_FP32_mb96_gb96_acc1_1n1g 399.62 / 11669 MiB 398.05 / 11669 MiB
bert_large_pretrain_eager_nl24_nah16_hs1024_FP32_
acfalse_DP1_MP1_PP1_zerofalse_stage0_mbs1_gbs1_acc1_1n1g
10.96 samples/s / 8703 MiB 10.91 samples/s / 8575 MiB
  • 测试没啥问题了。

ouyangyu avatar Nov 07 '22 03:11 ouyangyu

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions[bot] avatar Dec 18 '22 12:12 github-actions[bot]

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions[bot] avatar Dec 18 '22 12:12 github-actions[bot]

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions[bot] avatar Dec 19 '22 07:12 github-actions[bot]

为了稳妥起见,加了一个 ONEFLOW_PYTHON_STACK_GETTER 环境变量,需要用户显式打开

daquexian avatar Dec 19 '22 07:12 daquexian

Code got formatted by CI. Please request CI again if you still want to have this PR merged. If the PR is from a forked repo, please download the patch files from the GitHub Actions web page and apply them locally.

github-actions[bot] avatar Dec 19 '22 07:12 github-actions[bot]

CI failed when running job: cpu-misc. PR label automerge has been removed

github-actions[bot] avatar Dec 19 '22 13:12 github-actions[bot]

Speed stats:

github-actions[bot] avatar Dec 19 '22 13:12 github-actions[bot]

Speed stats:
GPU Name: GeForce GTX 1080 









❌ OneFlow resnet50 time: 139.8ms (= 13976.9ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 161.0ms (= 16100.3ms / 100, input_shape=[16, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 161.0ms / 139.8ms)

OneFlow resnet50 time: 85.4ms (= 8538.3ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 111.3ms (= 11126.4ms / 100, input_shape=[8, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.30 (= 111.3ms / 85.4ms)

OneFlow resnet50 time: 57.5ms (= 11492.9ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 88.8ms (= 17760.2ms / 200, input_shape=[4, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.55 (= 88.8ms / 57.5ms)

OneFlow resnet50 time: 43.8ms (= 8763.9ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 70.2ms (= 14045.3ms / 200, input_shape=[2, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.60 (= 70.2ms / 43.8ms)

OneFlow resnet50 time: 39.3ms (= 7866.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
PyTorch resnet50 time: 45.3ms (= 9069.3ms / 200, input_shape=[1, 3, 224, 224], ddp, world size=2)
✔️ Relative speed: 1.15 (= 45.3ms / 39.3ms)

github-actions[bot] avatar Dec 20 '22 11:12 github-actions[bot]

View latest API docs preview at: https://staging.oneflow.info/docs/Oneflow-Inc/oneflow/pr/8864/

github-actions[bot] avatar Dec 20 '22 11:12 github-actions[bot]