Paddle icon indicating copy to clipboard operation
Paddle copied to clipboard

While_loop OP 在训练中正常,在paddle inference中报错

Open HoratioJSY opened this issue 3 years ago • 6 comments

请提出你的问题 Please ask your question

我们使用 paddle.static.nn.while_loop() 构造循环体时,如果在训练program取循环结果的OP,能正常输出;当我们通过paddle.static.save_inference_model() 将program 导出成推断模型时,取循环结果的 OP 会报错,而且都是GPU方面的错。

后来我们参照 paddlenlp 中 GPT-3 循环解码部分,将paddle.static.nn.while_loop() 改为 paddle.fluid.layers.While() ,发现同样的现象,在训练中取循环结果都是正常的,但是导出成推断模型,取循环结果总是报GPU方面的错误。

后面初步定位,在不开启paddle.inference.Config().collect_shape_range_info()的情况下不会报错,在开启的情况下会报错。

如下是相关的信息:

  1. 测试模型训练中正常取循环体结果,即类似GPT解码多步预测结果:
2022-08-05 10:21:53,426 - INFO - lm task steps: 8 | epoch: 0 | loss: 3.229317 | acc: 0.484578 | strict_acc: 0.279502 | speed: 2.35 s/step | learning rate: 0.00000001
decoding three steps: ['Context<e>', '(<e>', 'container']
decoding three steps: ['public<e>', 'void<e>', 'close<e>']
2022-08-05 10:21:57,864 - INFO - lm task steps: 10 | epoch: 0 | loss: 2.351933 | acc: 0.618152 | strict_acc: 0.375983 | speed: 2.22 s/step | learning rate: 0.00000001
decoding three steps: ['"<_e>', '[<_e>', 'Callback<_e>']
decoding three steps: ['.<_e>', 'lang<_e>', '.<_e>']
2022-08-05 10:22:02,308 - INFO - lm task steps: 12 | epoch: 0 | loss: 2.739078 | acc: 0.525738 | strict_acc: 0.246813 | speed: 2.22 s/step | learning rate: 0.00000001
decoding three steps: ['{<e>', '"<_e>', '1<_e>']
decoding three steps: ['(<e>', 'context<e>', '(<e>']

如果不开启collect_shape_range_info(),Paddle Inference 也能正常解码多步。

  1. paddle inference 在不使用 TensorRT 和 INT8 的情况下推断,并开启paddle.inference.Config().collect_shape_range_info()下,取多步的解码结果:

I0805 02:37:44.866912 17526 analysis_predictor.cc:1007] ======= optimize end =======
I0805 02:37:44.874030 17526 naive_executor.cc:102] ---  skip [feed], feed -> no_mask_len
I0805 02:37:44.874073 17526 naive_executor.cc:102] ---  skip [feed], feed -> tokens
I0805 02:37:44.876395 17526 naive_executor.cc:102] ---  skip [save_infer_model/scale_0.tmp_0], fetch -> fetch
I0805 02:37:44.876420 17526 naive_executor.cc:102] ---  skip [save_infer_model/scale_1.tmp_0], fetch -> fetch
W0805 02:37:44.876917 17526 gpu_context.cc:278] Please NOTE: device: 1, GPU Compute Capability: 8.0, Driver API Version: 11.4, Runtime API Version: 11.2
W0805 02:37:44.879653 17526 gpu_context.cc:306] device: 1, cuDNN Version: 8.2.
W0805 02:37:47.084254 17526 operator.cc:284] softmax raises an exception paddle::PD_Exception, the gpu dnn handle is nullptr.
  [/paddle/paddle/phi/backends/gpu/gpu_context.cc:526]
W0805 02:37:47.085661 17526 operator.cc:284] while raises an exception paddle::PD_Exception, the gpu dnn handle is nullptr.
  [/paddle/paddle/phi/backends/gpu/gpu_context.cc:526]
Traceback (most recent call last):
  File "model_compression/quant/quant_infer_pangu.py", line 364, in <module>
    main()
  File "model_compression/quant/quant_infer_pangu.py", line 360, in main
    predictor.predict()
  File "model_compression/quant/quant_infer_pangu.py", line 330, in predict
    output = self.fake_predict_batch()
  File "model_compression/quant/quant_infer_pangu.py", line 185, in fake_predict_batch
    self.predictor.run()
RuntimeError: the gpu dnn handle is nullptr.
  [/paddle/paddle/phi/backends/gpu/gpu_context.cc:526]

另一种报错:


I0805 02:43:35.151588 22047 analysis_predictor.cc:1007] ======= optimize end =======
I0805 02:43:35.158573 22047 naive_executor.cc:102] ---  skip [feed], feed -> no_mask_len
I0805 02:43:35.158604 22047 naive_executor.cc:102] ---  skip [feed], feed -> tokens
I0805 02:43:35.161098 22047 naive_executor.cc:102] ---  skip [save_infer_model/scale_0.tmp_0], fetch -> fetch
I0805 02:43:35.161119 22047 naive_executor.cc:102] ---  skip [save_infer_model/scale_1.tmp_0], fetch -> fetch
W0805 02:43:35.161603 22047 gpu_context.cc:278] Please NOTE: device: 1, GPU Compute Capability: 8.0, Driver API Version: 11.4, Runtime API Version: 11.2
W0805 02:43:35.164307 22047 gpu_context.cc:306] device: 1, cuDNN Version: 8.2.
W0805 02:43:37.264005 22047 operator.cc:284] softmax raises an exception paddle::PD_Exception, the gpu stream is nullptr.
  [/paddle/paddle/phi/backends/gpu/gpu_context.cc:377]
W0805 02:43:37.265502 22047 operator.cc:284] while raises an exception paddle::PD_Exception, the gpu stream is nullptr.
  [/paddle/paddle/phi/backends/gpu/gpu_context.cc:377]
Traceback (most recent call last):
  File "model_compression/quant/quant_infer_pangu.py", line 364, in <module>
    main()
  File "model_compression/quant/quant_infer_pangu.py", line 360, in main
    predictor.predict()
  File "model_compression/quant/quant_infer_pangu.py", line 330, in predict
    output = self.fake_predict_batch()
  File "model_compression/quant/quant_infer_pangu.py", line 185, in fake_predict_batch
    self.predictor.run()
RuntimeError: the gpu stream is nullptr.
  [/paddle/paddle/phi/backends/gpu/gpu_context.cc:377]

HoratioJSY avatar Aug 05 '22 03:08 HoratioJSY

您好,我们已经收到了您的问题,会安排技术人员尽快解答您的问题,请耐心等待。请您再次检查是否提供了清晰的问题描述、复现代码、环境&版本、报错信息等。同时,您也可以通过查看官网API文档常见问题历史IssueAI社区来寻求解答。祝您生活愉快~

Hi! We've received your issue and please be patient to get responded. We will arrange technicians to answer your questions as soon as possible. Please make sure that you have posted enough message to demo your request. You may also check out the APIFAQGithub Issue and AI community to get the answer.Have a nice day!

paddle-bot[bot] avatar Aug 05 '22 03:08 paddle-bot[bot]

请问关闭所有优化的情况下能正常跑吗? config.SwitchIrOptim(false);

jiweibo avatar Aug 05 '22 03:08 jiweibo

关闭所有优化,只要开了collect_shape_range_info ,就会报错。

这个配置下是会报错的:


+--------------------------+-------------------------------------------------------------------------------+
| Option                   | Value                                                                         |
+--------------------------+-------------------------------------------------------------------------------+
| model_file               | /ParallelismCodeTransformer/quant_panGu/saved_model_20/QuantPanGu.pdmodel     |
| params_file              | /ParallelismCodeTransformer/quant_panGu/saved_model_20/QuantPanGu.pdiparams   |
+--------------------------+-------------------------------------------------------------------------------+
| cpu_math_thread          | 1                                                                             |
| enable_mkldnn            | false                                                                         |
| mkldnn_cache_capacity    | 10                                                                            |
+--------------------------+-------------------------------------------------------------------------------+
| use_gpu                  | true                                                                          |
| gpu_device_id            | 1                                                                             |
| memory_pool_init_size    | 1000MB                                                                        |
| thread_local_stream      | false                                                                         |
| use_tensorrt             | false                                                                         |
+--------------------------+-------------------------------------------------------------------------------+
| use_xpu                  | false                                                                         |
+--------------------------+-------------------------------------------------------------------------------+
| ir_optim                 | false                                                                         |
| ir_debug                 | false                                                                         |
| memory_optim             | false                                                                         |
| enable_profile           | false                                                                         |
| enable_log               | true                                                                          |
| collect_shape_range_info | /ParallelismCodeTransformer/quant_panGu/saved_model_20/shape_range_info.pbtxt |
+--------------------------+-------------------------------------------------------------------------------+

这个配置下是没问题的:


+--------------------------+-----------------------------------------------------------------------------+
| Option                   | Value                                                                       |
+--------------------------+-----------------------------------------------------------------------------+
| model_file               | /ParallelismCodeTransformer/quant_panGu/saved_model_20/QuantPanGu.pdmodel   |
| params_file              | /ParallelismCodeTransformer/quant_panGu/saved_model_20/QuantPanGu.pdiparams |
+--------------------------+-----------------------------------------------------------------------------+
| cpu_math_thread          | 1                                                                           |
| enable_mkldnn            | false                                                                       |
| mkldnn_cache_capacity    | 10                                                                          |
+--------------------------+-----------------------------------------------------------------------------+
| use_gpu                  | true                                                                        |
| gpu_device_id            | 1                                                                           |
| memory_pool_init_size    | 1000MB                                                                      |
| thread_local_stream      | false                                                                       |
| use_tensorrt             | false                                                                       |
+--------------------------+-----------------------------------------------------------------------------+
| use_xpu                  | false                                                                       |
+--------------------------+-----------------------------------------------------------------------------+
| ir_optim                 | false                                                                       |
| ir_debug                 | false                                                                       |
| memory_optim             | true                                                                        |
| enable_profile           | false                                                                       |
| enable_log               | true                                                                        |
| collect_shape_range_info | false                                                                       |
+--------------------------+-----------------------------------------------------------------------------+

HoratioJSY avatar Aug 05 '22 03:08 HoratioJSY

可能上面的观察不是很正确的,在保持训练和导出代码不变的情况下,我重新导出一次模型之后,gpu dnn handle is nullptr 的报错在开启collect_shape_range_info 的情况下一定会出现。而不开启collect_shape_range_info ,会随机出现这个报错,或者正常运行。

如果运行时配置export GLOG_v=10,所有日志也只有如下报错部分,会出现ERROR的字样:

W0805 06:52:14.390362 21907 operator.cc:284] softmax raises an exception paddle::PD_Exception, the gpu stream is nullptr.
  [/paddle/paddle/phi/backends/gpu/gpu_context.cc:377]
I0805 06:52:14.390684 21907 executor.cc:67] destroy ExecutorPrepareContext
W0805 06:52:14.392383 21907 operator.cc:284] while raises an exception paddle::PD_Exception, the gpu stream is nullptr.
  [/paddle/paddle/phi/backends/gpu/gpu_context.cc:377]
I0805 06:52:14.401441 21907 imperative.cc:1956] Tracer(0x416be40) set expected place Place(gpu:0)
I0805 06:52:14.401522 21907 global_utils.h:66] Set current tracer for Controller: 0
I0805 06:52:14.401544 21907 tracer.cc:52] Set current tracer: 0
I0805 06:52:14.409641 21907 mmap_allocator.cc:273] PID: 21907, MemoryMapFdSet: set size - 0
Traceback (most recent call last):
  File "model_compression/quant/quant_infer_pangu.py", line 365, in <module>
    main()
  File "model_compression/quant/quant_infer_pangu.py", line 361, in main
    predictor.predict()
  File "model_compression/quant/quant_infer_pangu.py", line 329, in predict
    output = self.fake_predict_batch()
  File "model_compression/quant/quant_infer_pangu.py", line 184, in fake_predict_batch
    self.predictor.run()
RuntimeError: the gpu stream is nullptr.
  [/paddle/paddle/phi/backends/gpu/gpu_context.cc:377]
I0805 06:52:14.456353 21907 stream_safe_cuda_allocator.cc:200] Try free allocation 0x7f8023367600
I0805 06:52:14.456416 21907 stream_safe_cuda_allocator.cc:202] Directly delete allocation
I0805 06:52:14.456430 21907 auto_growth_best_fit_allocator.cc:116] Free 256 bytes, ptr = 0x7f8023367600
I0805 06:52:14.456449 21907 stream_safe_cuda_allocator.cc:200] Try free allocation 0x7f7f4f217e00
I0805 06:52:14.456458 21907 stream_safe_cuda_allocator.cc:202] Directly delete allocation
I0805 06:52:14.456468 21907 auto_growth_best_fit_allocator.cc:116] Free 16640 bytes, ptr = 0x7f7f4f217e00
I0805 06:52:14.456488 21907 naive_best_fit_allocator.cc:101] Free pointer=0x7f7fc0ddb040 on Place(cpu)
I0805 06:52:14.456504 21907 buddy_allocator.cc:141] Free from address 0x7f7fc0ddb000

HoratioJSY avatar Aug 05 '22 07:08 HoratioJSY

看问题感觉很奇怪,CollectShape相当于是关掉所有优化运行,应当和关优化的效果相同的。

请问下能否Share下复现错误的模型呢,我本地复现下问题

jiweibo avatar Aug 09 '22 06:08 jiweibo

嗯嗯,稍等,我更新了paddle 2.3.0 到develop分支,对应的paddle inference也是,之前的问题有的解决了(主要是TensorRT子图方面的),后面我会再测一遍while_loop,如果还是一样的我再提供一个测试模型

HoratioJSY avatar Aug 09 '22 08:08 HoratioJSY

还是会出现这个问题,现在的现象是:while 算子在静态计算图的训练中都能正常输出结果,实现多步预测。但是当导出为inference模型,paddle inference 那边启动预测会随机出现三种异常:1. RuntimeError: the gpu stream is nullptr;2. Segmentation fault;3. 占GPU显存,不占GPU利用率,不报错一直等待。

第一种报错前面有,第二种报错是如下这种:


I0812 07:28:39.661653 23546 analysis_predictor.cc:1266] ======= optimize end =======
I0812 07:28:39.668598 23546 naive_executor.cc:110] ---  skip [feed], feed -> no_mask_len
I0812 07:28:39.668627 23546 naive_executor.cc:110] ---  skip [feed], feed -> common_len
I0812 07:28:39.668630 23546 naive_executor.cc:110] ---  skip [feed], feed -> user_id
I0812 07:28:39.668634 23546 naive_executor.cc:110] ---  skip [feed], feed -> tokens
I0812 07:28:39.670888 23546 naive_executor.cc:110] ---  skip [save_infer_model/scale_0.tmp_0], fetch -> fetch
I0812 07:28:39.670915 23546 naive_executor.cc:110] ---  skip [save_infer_model/scale_1.tmp_0], fetch -> fetch
W0812 07:28:39.671312 23546 gpu_resources.cc:61] Please NOTE: device: 1, GPU Compute Capability: 8.0, Driver API Version: 11.4, Runtime API Version: 11.2
W0812 07:28:39.674665 23546 gpu_resources.cc:91] device: 1, cuDNN Version: 8.4.


--------------------------------------
C++ Traceback (most recent call last):
--------------------------------------
0   paddle::AnalysisPredictor::ZeroCopyRun()
1   paddle::framework::NaiveExecutor::Run()
2   paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, phi::Place const&)
3   paddle::operators::WhileOp::RunImpl(paddle::framework::Scope const&, phi::Place const&) const
4   paddle::framework::Executor::RunPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, bool, bool, bool)
5   paddle::framework::Executor::RunPartialPreparedContext(paddle::framework::ExecutorPrepareContext*, paddle::framework::Scope*, long, long, bool, bool, bool)
6   paddle::framework::OperatorBase::Run(paddle::framework::Scope const&, phi::Place const&)
7   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&) const
8   paddle::framework::OperatorWithKernel::RunImpl(paddle::framework::Scope const&, phi::Place const&, paddle::framework::RuntimeContext*) const
9   void phi::SoftmaxForwardCUDAKernelDriver<float, false>(phi::GPUContext const&, phi::DenseTensor const&, int, phi::DenseTensor*)
10  void phi::SwitchWarpSoftmaxForward<float, int2, false>(int, dim3, phi::GPUContext const&, float*, float const*, int, int, int, int)
11  phi::GPUContext::Impl::stream() const

----------------------
Error Message Summary:
----------------------
FatalError: `Segmentation fault` is detected by the operating system.
  [TimeInfo: *** Aborted at 1660289320 (unix time) try "date -d @1660289320" if you are using GNU date ***]
  [SignalInfo: *** SIGSEGV (@0x10) received by PID 23546 (TID 0x7f76c7e6c740) from PID 16 ***]


如果需要模型我这边可以提供一个测试模型,但是现在我的解决方案是不再使用while 算子,而是在模型内部创建一个Parameter张量,从而保存模型的中间计算状态,下次复用这个中间计算状态,这样速度上相比while算子应该只会慢一点。

HoratioJSY avatar Aug 12 '22 07:08 HoratioJSY