QAT_Engine icon indicating copy to clipboard operation
QAT_Engine copied to clipboard

Nginx worker CPU 100%,endless loop

Open lizj3624 opened this issue 3 years ago • 4 comments

  1. env qat driver qat1.7.l.4.9.0-00008 Nginx 1.16 OpenSSL 1.1.1g QAT_Engine v0.5.44

  2. bug 企业咚咚20210525172606 企业咚咚20210525172625

  3. debug

  1. strace -o output.txt -T -tt -e trace=all -p 25130

cat output.txt

15:05:55.463315 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000030> 15:05:55.463462 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000017> 15:05:55.463510 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000024> 15:05:55.463567 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.463602 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000012> 15:05:55.463638 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.463669 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.463700 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.463731 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000010> 15:05:55.463762 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000010> 15:05:55.463793 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000010> 15:05:55.463824 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.463855 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.463886 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.463917 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.463948 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.463979 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.464010 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000010> 15:05:55.464041 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000010> 15:05:55.464072 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000010> 15:05:55.464104 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.464135 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.464184 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000012> 15:05:55.464217 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.464248 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.464279 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.464310 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.464341 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.464372 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.464403 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.464434 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.464465 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.464496 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.464527 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011> 15:05:55.464558 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000010> 15:05:55.464589 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000010> 15:05:55.464619 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000010> 15:05:55.464650 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000010> 15:05:55.464682 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000010> 15:05:55.464713 read(4297, 0x55bc6459e4b8, 8) = -1 EAGAIN (Resource temporarily unavailable) <0.000011>

  1. perf report 企业咚咚20210526110955

lizj3624 avatar May 26 '21 03:05 lizj3624

As mentioned in Nginx GitHub, Could you please check with latest of all software (Nginx, Qat_Engine, Driver) and see if it still gets reproduced

Yogaraj-Alamenda avatar Jun 03 '21 10:06 Yogaraj-Alamenda

Hi all, I have the quite same problem with the latest software version: async_nginx: 0.4.5 openssl: 1.1.1k qatengine: 0.6.4 qatdriver: 1.7.l.4.13.0.9

the reproduce situation: config values in nginx.conf : default_algorithms CIPHERS qat_poll_mode heuristic

I have debuged async_ningx and found there is a infinite loop. I think this is the reason here.

1 in function ngx_http_do_read_client_request_body(), nginx goin the for(;;)[line:288] loop and never break. as recv()[line:343] always return NGX_AGAIN, and c->read->ready always == 1 go deep in recv(), the NGX_AGAIN is return by func ngx_ssl_handle_recv()::line:2546 because of async job is paused. 2. when async context swapd, an other infinite loop was happend. in function qat_chained_ciphers_do_cipher() line:1554 as the read()[qat_pause_job():line279] always return EAGAIN. 3. As I know qat_crypto_callbackFn() is called by func qat_engine_poll(). I think, this because of the callback function qat_crypto_callbackFn() never have any CPU chance/CPU TIME to be called, then the paused async job never be waked up. then I check the POLL logic in async_nginx. I found point 4 descripte below. 4. In function ngx_ssl_engine_qat_heuristic_poll(), all the values of the six variables(num_*) never grow up, so function qat_engine_poll() have no any chance to execute.

when I change my engine config in nginx.conf, this issue is disappear, and i can work around. the config like below: qat_heuristic_poll_asym_threshold = 0 qat_heuristic_poll_sym_threshold = 0

It seems a logic deadlock here ? nginx want qat to update counters but counters updated need nginx release some CPU time. or, maybe the following code do not consider the long time idle SSL connections ? if (*num_asym_requests_in_flight + *num_kdf_requests_in_flight + *num_cipher_requests_in_flight + *num_asym_mb_items_in_queue + *num_kdf_mb_items_in_queue + *num_sym_mb_items_in_queue >= (int) *ngx_ssl_active) {

Anyone have any idea about this ?

tony-caotong avatar Jun 15 '21 09:06 tony-caotong

It's strange to see that all the num_* recorded in ngx_ssl_engine_qat_heuristic_poll() never increased. Because the number is recorded in QAT-Engine and increased each time when a qat_chained_ciphers_do_cipher() being called. https://github.com/intel/QAT_Engine/blob/378bfeb75d7bbc654991113955d25c8f9f95c2db/qat_hw_ciphers.c#L1536 Heuristic polling is designed to save polling overhead by polling QAT only when in-flight number of requests bigger than the predefined threshold. If the heuristic polling is enabled and the share info runs without error, the heuristic polling should run correctly.

daweiq avatar Jun 29 '21 03:06 daweiq

Yes, must be my description above not very clearly. The num_* are never increased since that nginx worker process fall into this loop code, not breakout: https://github.com/intel/asynch_mode_nginx/blob/7f674ff25c47d7dd5e335bed7a3fd6154d735b2e/src/http/ngx_http_request_body.c#L281-L378 Because of line:336 return EAGAIN, and line:387 ready==1

Any new connection or request can not be processed, so it never increased.

Those days, I review codes agian, and gdb it, have some new update. The async job in QAT_engine is pending in loop: https://github.com/intel/QAT_Engine/blob/378bfeb75d7bbc654991113955d25c8f9f95c2db/qat_hw_ciphers.c#L1539-L1555

Because the qat_wake_job() in qat_chained_callbackFn() have no chance to be called. https://github.com/intel/QAT_Engine/blob/9954c0c0aa5b52126716611ef9248e6de4fbd9ef/qat_ciphers.c#L409

Why qat_chained_callbackFn() is not called? Because the threshold in ngx_ssl_engine_qat_heuristic_poll() called from ngx_ssl_async_process_fds() would never be reached. https://github.com/intel/asynch_mode_nginx/blob/7f674ff25c47d7dd5e335bed7a3fd6154d735b2e/modules/nginx_qat_module/ngx_ssl_engine_qat_module.c#L726

Another chance for qat_engine_poll() is under nginx timer, but the deadloop let the timer invalid too.

according to that, I change the two threshold config (qat_heuristic_poll_asym_threshold, qat_heuristic_poll_sym_threshold) to 0, and it work around.

I wish I explain it clearly this time...

tony-caotong avatar Jun 30 '21 12:06 tony-caotong