kamailio
kamailio copied to clipboard
Memory usage increases everytime tls.reload is executed
Description
We are using Kamailio 5.7.4 on Debian 12 (from http://deb.kamailio.org/kamailio57) with rtpengine as an Edgeproxy for our clients. The instance terminates SIP/TLS (with Cliencertificates) and forwards the SIP Traffic to internal systems.
After some days we are getting errors like this
tls_complete_init(): tls: ssl bug #1491 workaround: not enough memory for safe operation: shm=7318616 threshold1=8912896
First we thought Kamailio just doesnt have enough memory, so we doubled it..
But after some days the Logmessage (and Userissues) occured again.
So we monitored the shmmem statistics and found that used and max_used are constantly growing til it reaches the limit.
As i mentioned we are using client-certificates and so we are also using the CRL feature. We do have a systemd-timer which fetches the CRL every hour and runs 'kamcmd tls.reload' when finished.
Our tls.cfg looks like this:
[server:default]
method = TLSv1.2+
private_key = /etc/letsencrypt/live/hostname.de/privkey.pem
certificate = /etc/letsencrypt/live/hostname.de/fullchain.pem
ca_list = /etc/kamailio/ca_list.pem
ca_path = /etc/kamailio/ca_list.pem
crl = /etc/kamailio/combined.crl.pem
verify_certificate = yes
require_certificate = yes
[client:default]
verify_certificate = yes
require_certificate = yes
After testing a bit we found that every time tls.reload is executed Kamailio consumes a bit more memory which eventually leads to all the memory being consumed which leads to issues for our users.
See following example:
[0][root@edgar-dev:~]# while true ; do /usr/sbin/kamcmd tls.reload ; /usr/sbin/kamcmd core.shmmem ; sleep 1 ; done
Ok. TLS configuration reloaded.
{
total: 268435456
free: 223001520
used: 41352552
real_used: 45433936
max_used: 45445968
fragments: 73
}
Ok. TLS configuration reloaded.
{
total: 268435456
free: 222377960
used: 41975592
real_used: 46057496
max_used: 46069232
fragments: 78
}
Ok. TLS configuration reloaded.
{
total: 268435456
free: 221748664
used: 42604992
real_used: 46686792
max_used: 46698080
fragments: 77
}
Ok. TLS configuration reloaded.
{
total: 268435456
free: 221110832
used: 43242408
real_used: 47324624
max_used: 47335608
fragments: 81
}
^C
[130][root@edgar-dev:~]#
Troubleshooting
Reproduction
Everytime tls.reload is called the memory consumptions grows..
Debugging Data
If you let me know what would be interesting for tracking this down, i am happy to provide logs/debugging data!
Log Messages
If you let me know what would be interesting for tracking this down, i am happy to provide logs/debugging data!
SIP Traffic
SIP doesnt seem to be relevant here
Possible Solutions
Calling tls.reload less often or restart kamailio before memory is consumed ;)
Additional Information
version: kamailio 5.7.4 (x86_64/linux)
flags: USE_TCP, USE_TLS, USE_SCTP, TLS_HOOKS, USE_RAW_SOCKS, DISABLE_NAGLE, USE_MCAST, DNS_IP_HACK, SHM_MMAP, PKG_MALLOC, MEM_JOIN_FREE, Q_MALLOC, F_MALLOC, TLSF_MALLOC, DBG_SR_MEMORY, USE_FUTEX, FAST_LOCK-ADAPTIVE_WAIT, USE_DNS_CACHE, USE_DNS_FAILOVER, USE_NAPTR, USE_DST_BLOCKLIST, HAVE_RESOLV_RES, TLS_PTHREAD_MUTEX_SHARED
ADAPTIVE_WAIT_LOOPS 1024, MAX_RECV_BUFFER_SIZE 262144, MAX_URI_SIZE 1024, BUF_SIZE 65535, DEFAULT PKG_SIZE 8MB
poll method support: poll, epoll_lt, epoll_et, sigio_rt, select.
id: unknown
compiled with gcc 12.2.0
- Operating System:
* Debian GNU/Linux 12 (bookworm)
* Linux edgar-dev 6.1.0-20-amd64 #1 SMP PREEMPT_DYNAMIC Debian 6.1.85-1 (2024-04-11) x86_64 GNU/Linux
I just realized that i forgot to mention.. in addition to the logged error message our clients start to get connection issues as well, so we have to restart Kamailio asap in that case..
@denzs do you have a monitoring tool? Prometheus + Graphana graphs?
Probably this part has to be reviewed ... first the tls reload was initially designed to be done rather rarely, when the certificates expires. The CRL feature was also not much in use, at least in what I could experience so far, most of the deployments are with server-side only certificates.
Furthermore, I am not sure if old certificates can be cleared right away after the restart, existing connections are not closed and there might be some references to their certificates.
Are you doing the reload only if there are changes in the content of the crl or certificate files? Or the reload is done anyhow?
@sergey-safarov yes we do :)
@miconda at the moment we do the tls.reload unconditionally and quite 'high frequently' to ensure the CRLs are up to date.. of course we can check if the CRL changed, but from my point of view that would only delay the neccesary restart of kamailio..
This Screenshot is from our dev environment (with no tls-clients connected) running:
while true ; do /usr/sbin/kamcmd tls.reload ; /usr/sbin/kamcmd tls.reload ; sleep 0.5 ; done
Parallel watching core.shmmem outpot looks like:
Ok. TLS configuration reloaded.
{
total: 268435456
free: 1894256
used: 262444424
real_used: 266541200
max_used: 266550968
fragments: 85
}
error: 500 - Error while fixing TLS configuration (consult server log)
{
total: 268435456
free: 1208784
used: 263491296
real_used: 267226672
max_used: 268435208
fragments: 11749
}
Ok. TLS configuration reloaded.
{
total: 268435456
free: -9223372036854776
used: 267589696
real_used: 271686888
max_used: 271696928
fragments: 87
}
Could you compare it with a graph for our server for last 60 days and about 25 WebRTC clients?
and
Here used Kamailio 5.7.2 with Letencrypt server. Cert reloads once per two-mouth. We dot use CRL. To avoid too often cert reloads we compare currently used certificates and the last cert using commands like.
rsync -l --recursive --info=name --dry-run ${LECRTSDIR} ${LETARGETDIR} >${CHKUPDLOG}
# Synchronizing certificates.
if [ ! -s ${CHKUPDLOG} ]; then
echo "Check updates. No changes required"
rm -f ${CHKUPDLOG}
else
echo "Has new certificates. Start sync"
rsync -azlcv --recursive --delete --info=name ${LECRTSDIR} ${LETARGETDIR} >"${SYNCLOG}"
rm -f ${CHKUPDLOG}
fi
The problem actually occured after we added the CRL some weeks ago.. without CRL there was no such behaviour. And of course there are a lot options to mitigate the issue respectively decrease the propability by doing less reloads by decreasing the cycle and/or check if there was a change at the CRL at all..
Anyhow i thought raising an issue makes sense, because from my point of view there is definitively some memory leaking when using tls.reload in combination with a CRL..
If it happens only with adding a CRL, it looks indeed like an issue in this code path. In the end using CRL is probably quite rare.
After some time debuging, I could replicate this issue of memory increase when using a CRL and tls.reload.
One possible issue according to memory statistics printed frequently while we have while true ; do /usr/sbin/kamcmd tls.reload ; /usr/sbin/kamcmd tls.reload ; sleep 0.5 ; done
running is:
INFO: qm_sums: qm_sums(): count= 5288 size= 183440 bytes from tls: tls_init.c: ser_realloc(372)
INFO: qm_sums: qm_sums(): count= 17378 size= 1275712 bytes from tls: tls_init.c: ser_malloc(364)
---
INFO: qm_sums: qm_sums(): count= 5341 size= 242768 bytes from tls: tls_init.c: ser_realloc(372)
INFO: qm_sums: qm_sums(): count= 17325 size= 1381936 bytes from tls: tls_init.c: ser_malloc(364)
---
INFO: qm_sums: qm_sums(): count= 5331 size= 248544 bytes from tls: tls_init.c: ser_realloc(372)
INFO: qm_sums: qm_sums(): count= 17335 size= 1422112 bytes from tls: tls_init.c: ser_malloc(364)
---
INFO: qm_sums: qm_sums(): count= 5360 size= 290560 bytes from tls: tls_init.c: ser_realloc(372)
INFO: qm_sums: qm_sums(): count= 17306 size= 1466000 bytes from tls: tls_init.c: ser_malloc(364)
Memory here increases until we exhaust the shared memory max allocation and then tls.reload fails.
Some notes: When using tls.reload without a CRL, I didn't see any notable increase in memory usage. The above-noted allocations are steady around
count= 9415 size= 948432 bytes from tls: tls_init.c: ser_malloc(364)
count= 1011 size= 151408 bytes from tls: tls_init.c: ser_realloc(372)
This issue is stale because it has been open 6 weeks with no activity. Remove stale label or comment or this will be closed in 2 weeks.
Although it is quite easy to monitor and workaround this issue - i still think it is a valid bug :)
Just for reference, this was discussed on the developer list, thread: https://lists.kamailio.org/mailman3/hyperkitty/list/[email protected]/message/AJMGLWJNQGA6C7SKLVQEXI5RFRRRWBN2/
This issue is stale because it has been open 6 weeks with no activity. Remove stale label or comment or this will be closed in 2 weeks.
Are there any news/intentions on merging the branch from xkaraman? :)
Hey @denzs,
it's been some time i have checked this sorry.
There was a discussion about introducing a parameter for this change. I will try to implement it asap, so i can create a PR for this and reinitiate the discussion!
Thanks for your patience, Xenofon
@xkaraman thank you so much! I did not want to rush you, i just wanted to prevent this issue from being auto-closed :)
Hey @denzs,
I have just create https://github.com/kamailio/kamailio/pull/3972 for this.
Can you maybe check whether kamailio still functions as intended (other than the tls.reload) with the new shared context stuff?
After applying the patch, set the new tls parameter enable_shared_ctx
to 1 to the config file and you are good to go.
Any feedback is welcome!
@xkaraman thank you so much for taking care of this! :)
I tested your branch on our dev instance, the normal functions are doing fine so far :+1:
But switchting enable_shared_ctx
from 0 to 1 only seems to delay the memory leaking issue:
The first 5 minutes are with enable_shared_ctx=0
and the rest with enable_shared_ctx=1
.
During the last 5 minutes i stopped the tls.reload to see if memory consumption would descrease again, but that is not the case..
Tested with: while true ; do /usr/sbin/kamcmd tls.reload ; sleep 0.5 ; done
Hey @denzs,
Thanks for testing this out.
As we discussed in the mailing list and also as noted in the PR, indeed this patch is not adequate to fix the actual problem. I was trying to lower the memory usage and hoped that the increase would not really be noticable any more (clearly not the case from your report).
The problem seems to be in the SSL_CTX_load_verify_locations
and the usage of it in the load_crl()
. I will keep digging and see if there is something to be done to actually free the memory.
Just for refernece what OpenSSL are you testing this with?
@xkaraman thanks for your feedback :) It is a Debian 12 system with:
ii libssl-dev:amd64 3.0.14-1~deb12u2 amd64 Secure Sockets Layer toolkit - development files
ii libssl3:amd64 3.0.14-1~deb12u2 amd64 Secure Sockets Layer toolkit - shared libraries
ii openssl 3.0.14-1~deb12u2 amd64 Secure Sockets Layer toolkit - cryptographic utility
This issue is stale because it has been open 6 weeks with no activity. Remove stale label or comment or this will be closed in 2 weeks.
Just a 'ping' to prevent the bot from closing the issue.. :)