asynch_mode_nginx
asynch_mode_nginx copied to clipboard
QAT Engine failed: HEARTBEAT_POLL
What is the problem
Async nginx with QAT configuration starts but is constantly logging the error:
[alert] 25453#0: QAT Engine failed: HEARTBEAT_POLL
System description
- cpu: Xeon Platinum 8480+
- system: Ubuntu 22.04.3
- kernel: 5.15.0-91-generic
- QAT OOT v. 20.L.1.0.50-00003
- OpenSSL version 3.0.2
- Async nginx v. 0.5.1
QAT configuration
[GENERAL]
ServicesEnabled = asym;dc
ConfigVersion = 2
#Default values for number of concurrent requests*/
CyNumConcurrentSymRequests = 512
CyNumConcurrentAsymRequests = 64
#Statistics, valid values: 1,0
statsGeneral = 1
statsDh = 1
statsDrbg = 1
statsDsa = 1
statsEcc = 1
statsKeyGen = 1
statsDc = 1
statsLn = 1
statsPrime = 1
statsRsa = 1
statsSym = 1
# Default heartbeat timer is 1s
HeartbeatTimer = 1000
# This flag is to enable SSF features
StorageEnabled = 0
# Disable public key crypto and prime number
# services by specifying a value of 1 (default is 0)
PkeServiceDisabled = 0
# This flag is to enable device auto reset on heartbeat error
AutoResetOnError = 0
# Default value for power management idle interrupt delay
PmIdleInterruptDelay = 512
# This flag is to enable power management idle support
PmIdleSupport = 1
# This flag is to enable key protection technology
KptEnabled = 1
# Define the maximum SWK count per function can have
# Default value is 1, the maximum value is 128
KptMaxSWKPerFn = 1
# Define the maximum SWK count per pasid can have
# Default value is 1, the maximum value is 128
KptMaxSWKPerPASID = 1
# Define the maximum SWK lifetime in second
# Default value is 0 (eternal of life)
# The maximum value is 31536000 (one year)
KptMaxSWKLifetime = 31536000
# Flag to define whether to allow SWK to be shared among processes
# Default value is 0 (shared mode is off)
KptSWKShared = 0
# Disable AT
ATEnabled = 0
##############################################
# Kernel Instances Section
##############################################
[KERNEL]
NumberCyInstances = 0
NumberDcInstances = 0
##############################################
# ADI Section for Scalable IOV
##############################################
[SIOV]
NumberAdis = 0
##############################################
# User Process Instance Section
##############################################
[SHIM]
NumberCyInstances = 1
NumberDcInstances = 1
NumProcesses = 32
LimitDevAccess = 1
# Crypto - User instance #0
Cy0Name = "UserCY0"
Cy0IsPolled = 1
# List of core affinities
Cy0CoreAffinity = 0
# Crypto - Data compression instance #0
Dc0Name = "UserDC0"
Dc0IsPolled = 1
# List of core affinities
Dc0CoreAffinity = 0
# Crypto - User instance #1
Cy1Name = "UserCY1"
Cy1IsPolled = 1
# List of core affinities
Cy1CoreAffinity = 1
# Crypto - User instance #2
Cy2Name = "UserCY2"
Cy2IsPolled = 1
# List of core affinities
Cy2CoreAffinity = 2
# Crypto - User instance #3
Cy3Name = "UserCY3"
Cy3IsPolled = 1
# List of core affinities
Cy3CoreAffinity = 3
Nginx configuration
worker_processes 224;
# TODO: possibly change workers to non-root
# This setting was made because otherwise `nobody` is worker owner
# and nginx cannot access html file due to lack of access to intermediate directory
# [~] Following line is to adjust settings from repository
user root root;
worker_rlimit_nofile 32000;
load_module modules/ngx_http_qatzip_filter_module.so;
load_module modules/ngx_ssl_engine_qat_module.so;
events {
use epoll;
worker_connections 102400;
accept_mutex off;
}
# Enable QAT engine in heretic mode.
ssl_engine {
use_engine qatengine;
default_algorithms RSA,EC,DH,DSA;
qat_engine {
qat_offload_mode async;
qat_notify_mode poll;
qat_poll_mode heuristic;
qat_sw_fallback on;
}
}
http {
gzip on;
gzip_min_length 128;
gzip_comp_level 1;
gzip_types text/css text/javascript text/xml text/plain text/x-component application/javascript application/json application/xml application/rss+xml font/truetype font/opentype application/vnd.ms-fontobject image/svg+xml;
gzip_vary on;
gzip_disable "msie6";
gzip_http_version 1.0;
qatzip_sw failover;
qatzip_min_length 128;
qatzip_comp_level 1;
qatzip_buffers 16 8k;
qatzip_types text/css text/javascript text/xml text/plain text/x-component application/javascript application/json application/xml application/rss+xml font/truetype font/opentype application/vnd.ms-fontobject image/svg+xml application/octet-stream image/jpeg;
qatzip_chunk_size 64k;
qatzip_stream_size 256k;
qatzip_sw_threshold 256;
# HTTP server with QATZip enabled.
server {
listen 80;
server_name localhost;
location / {
root html;
index index.html index.htm;
}
}
# HTTPS server with async mode.
server {
#If QAT Engine enabled, `asynch` need to add to `listen` directive or just add `ssl_asynch on;` to the context.
listen 443 ssl asynch;
server_name localhost;
ssl_protocols TLSv1.2;
ssl_certificate cert.pem;
ssl_certificate_key cert.key;
location / {
root html;
index index.html index.htm;
}
}
}
What is working
Openssl is sort of working with QAT:
$ openssl engine -t -c -v qatengine
(qatengine) Reference implementation of QAT crypto engine(qat_hw & qat_sw) v1.4.0
[RSA, AES-128-CBC-HMAC-SHA256, AES-256-CBC-HMAC-SHA256, id-aes128-GCM, id-aes192-GCM, id-aes256-GCM, TLS1-PRF, X25519, X448, SM2]
[ available ]
ENABLE_EXTERNAL_POLLING, POLL, SET_INSTANCE_FOR_THREAD,
GET_NUM_OP_RETRIES, SET_MAX_RETRY_COUNT, SET_INTERNAL_POLL_INTERVAL,
GET_EXTERNAL_POLLING_FD, ENABLE_EVENT_DRIVEN_POLLING_MODE,
GET_NUM_CRYPTO_INSTANCES, DISABLE_EVENT_DRIVEN_POLLING_MODE,
SET_EPOLL_TIMEOUT, SET_CRYPTO_SMALL_PACKET_OFFLOAD_THRESHOLD,
ENABLE_INLINE_POLLING, ENABLE_HEURISTIC_POLLING,
GET_NUM_REQUESTS_IN_FLIGHT, INIT_ENGINE, SET_CONFIGURATION_SECTION_NAME,
ENABLE_SW_FALLBACK, HEARTBEAT_POLL, DISABLE_QAT_OFFLOAD, HW_ALGO_BITMAP,
SW_ALGO_BITMAP
803B6038F27F0000:error:1280006A:DSO support routines:dlfcn_bind_func:could not bind to the requested symbol name:../crypto/dso/dso_dlfcn.c:188:symname(EVP_PKEY_base_id): /usr/local/ssl/lib64/engines-3/qatengine.so: undefined symbol: EVP_PKEY_base_id
803B6038F27F0000:error:1280006A:DSO support routines:DSO_bind_func:could not bind to the requested symbol name:../crypto/dso/dso_lib.c:176:
$ openssl speed -engine qatengine -elapsed -async_jobs 72 rsa2048
Engine "qatengine" set.
You have chosen to measure elapsed time instead of user CPU time.
Doing 2048 bits private rsa's for 10s: 311671 2048 bits private RSA's in 10.01s
Doing 2048 bits public rsa's for 10s: 3722703 2048 bits public RSA's in 10.00s
version: 3.0.2
built on: Fri Oct 13 12:02:49 2023 UTC
options: bn(64,64)
compiler: gcc -fPIC -pthread -m64 -Wa,--noexecstack -Wall -Wa,--noexecstack -g -O2 -ffile-prefix-map=/build/openssl-8L8jlV/openssl-3.0.2=. -flto=auto -ffat-lto-objects -flto=auto -ffat-lto-objects -fstack-protector-strong -Wformat -Werror=format-security -DOPENSSL_TLS_SECURITY_LEVEL=2 -DOPENSSL_USE_NODELETE -DL_ENDIAN -DOPENSSL_PIC -DOPENSSL_BUILDING_OPENSSL -DNDEBUG -Wdate-time -D_FORTIFY_SOURCE=2
CPUINFO: OPENSSL_ia32cap=0x7ffef3ffffebffff:0xfb417ffef3bfb7ef
sign verify sign/s verify/s
rsa 2048 bits 0.000032s 0.000003s 31136.0 372270.3
Additional information
I need to to run export OPENSSL_ENGINES=/usr/local/ssl/lib64/engines-3
otherwise openssl can't find the engine
I was told that software fallback with heartbeat is not supported in QAT2.0 driver v. 20.L.1.0.50-00003. So I should turn off qat_sw_fallback
in the nginx.conf.
So I changed that entry:
qat_sw_fallback off;
I did that, however there are still errors. When I start nginx it's okay. But once I run first request, I start getting following errors, even after cancelling request:
QAT Engine failed: POLL
Looks like we identified the issue reason. It happened because workers number was much greater than number of available HW QAT instances. As long as I understand, when HW QAT instances pool is exhausted, rest or nginx workers should receive SW QAT ones. But for some reason, it doesn't happen.
I have enabled QAT verbose debug logs, by adding --enable-qat_debug
to ./configure
of QAT engine. Therefore, QAT was logging everything to error.log
file of nginx. And we were able to spot lines that explains the POLL error:
[WARN][2332072.319774] PID [179324] Thread [7f0ae03ad740][e_qat.c:742:qat_engine_ctrl()] POLL failed as no instances are available
2024/01/04 16:17:42 [alert] 179324#0: QAT Engine failed: POLL
Temporary solution for now is to lower number of workers (worker_processes
in nginx.conf) to the number matching QAT HW instances. In my example, QAT driver conf (/etc/4xxx_dev0.conf), SHIM
section has following lines:
[SHIM]
NumberCyInstances = 1
NumberDcInstances = 1
NumProcesses = 32
LimitDevAccess = 1
NumberCyInstances
x NumProcesses
= 32. On 2 socket instance with 2 CPUs and QAT modules, we have 32 x 2 = 64. So 64 is maximum number of workers, for which there are enough HW QAT instances.
@kkurzacz-intel The issue is with qatengine when run with external polling where it is trying to poll an instance for heartbeat for the worker process that does not have qat_hw instance which should do qat_sw polling only. We will fix it in the qatengine.
That being said, in addition to the workaround you have mentioned, here is 2 other alternatives.
- For the external polling mode, set qat_sw_fallback to off or remove the qat_sw_fallback parameter om nginx.conf (as it off by default). This turns off Heartbeat polling and disables fallback on device failure but still be able to fallback to qat_sw when there is no instance.
- Set polling mode to internal (qat_poll_mode internal;) in nginx.conf which is handled by engine by creating internal polling thread for the available instances for all the workers. This way you dont have to reduce number of workers.
Please let us know if that works
The issue mentioned here is closed with the commit below in QAT Engine and relased in QAT Engine v1.6.0 https://github.com/intel/QAT_Engine/commit/3a1fca3138c96054721bebe19861b0cd6dc449af. Hence closing this