openvpn icon indicating copy to clipboard operation
openvpn copied to clipboard

Title: Memory leak after ~48h when using duplicate-cn with TLS renegotiation (reneg-sec) enabled

Open yvanitou-amical opened this issue 2 months ago • 9 comments

Describe the bug The OpenVPN server process memory usage increases continuously (trend consistent with exponential growth) over approximately 48 hours while the number of connected clients remains stable (~80 clients). The server process becomes unresponsive (management interface stops responding) and must be restarted. No explicit errors appear in the logs prior to loss of responsiveness.


To Reproduce

  1. Build and run OpenVPN 2.6.14 on an Alpine Linux 3.21 virtual machine (musl libc), compiled from the official release tarball with compilation options similar to:

    ./configure \
      --enable-lzo \
      --enable-lz4 \
      --enable-plugins \
      --enable-port-share \
      --enable-iproute2
    

    Runtime library versions (per environment banners): OpenSSL 3.3.5, LZO 2.10, LZ4 1.9.x.

  2. Server configuration includes (abridged):

    port 1194
    proto udp
    dev tun
    server 172.27.224.0 255.255.240.0
    topology subnet
    
    duplicate-cn
    ifconfig-pool-persist /var/lib/openvpn/ipp.txt
    keepalive 10 120
    reneg-sec 3600
    
    mssfix 1420
    client-to-client
    data-ciphers AES-256-GCM:AES-256-CBC:CHACHA20-POLY1305:AES-128-GCM
    data-ciphers-fallback AES-256-GCM
    auth SHA256
    
    tls-version-min 1.2
    tls-version-max 1.3
    tls-crypt /etc/openvpn/pki/tls-crypt.key
    
    sndbuf 0
    rcvbuf 0
    fast-io
    
    user openvpn
    group openvpn
    persist-key
    persist-tun
    
    verb 3
    management 0.0.0.0 7505
    

    (Startup warning observed: --ifconfig-pool-persist will not work with --duplicate-cn.)

  3. Maintain approximately 80 clients connected continuously using the same client certificate (so duplicate-cn is in effect).

  4. Monitor the server process memory over time using standard host tools:

    ps -o pid,comm,rss | grep openvpn   # BusyBox ps; RSS in KB
    

    or equivalent host-level monitoring.

  5. After ~24–48 hours, observe continuous memory growth (trend consistent with exponential increase) while client count remains steady. The server process becomes unresponsive and requires a restart.


Expected behavior Server memory usage should remain stable over long uptimes with a steady number of connected clients. Normal client activity (connect/disconnect, TLS handshakes) should not cause unbounded memory growth or loss of responsiveness.


Version information (please complete the following information):

  • OS: Alpine Linux 3.21 (musl) running on a virtual machine
  • Architecture: aarch64
  • OpenVPN version: 2.6.14 (built from official release tarball)
  • OpenSSL: 3.3.5
  • Peer version (if relevant): OpenVPN 2.6.14
  • Transport: UDP

Additional context

  • The number of connected clients is stable (~80) during the observation window; the increase in memory is not due to added clients.
  • Logs show regular client activity and repeated MULTI: Learn: entries for existing sessions; no explicit errors precede the loss of responsiveness.
  • The management interface (bound to 0.0.0.0:7505) becomes non-responsive before the restart.
  • The issue reproduces consistently under the conditions described.

yvanitou-amical avatar Nov 04 '25 19:11 yvanitou-amical

This is had to diagnose without additional details. We have similar setups that are running days and weeks without becoming unresponsive.

Also you seem to have two issues that might be related or not. The memory leak and the management interface becoming unresponsive. Also what kind of memory leak/growth are we talking about here?

What exactly are the sympton of management interface not resposive anymore. Do you don't get any notifcations from OpenVPN anymore or is the management interface not responding to your commands anymore?

Providing some of the log might help us to at least have some guesses.

repeated MULTI: Learn: entries for existing sessions

Could you give some more details what the repeated log entries are? Having them repeat for ongoing session is rather unexpected.

data-ciphers-fallback AES-256-GCM

This is nonsense. You tell OpenVPN to fall back to a cipher that clients that do not support cipher negoation do not support.

schwabe avatar Nov 05 '25 11:11 schwabe

My gut feeling is that something in duplicate-cn is not being cleaned up properly on reconnecting with the same cert over and over.

One of these edge cases that should have never been implemented in the first place...

cron2 avatar Nov 05 '25 12:11 cron2

It might also be something like a lot of connects/disconnect from the management interface that cuase the issue. We need more information to understand what is going on.

schwabe avatar Nov 05 '25 13:11 schwabe

Update – consolidated data (IPs redacted, no configuration changes)

Keeping the same process running to preserve state; no restart or config change has been made. The next crash normally happens after ~48h of uptime. If it does, I’ll capture the same datasets right before and after the freeze to document how the management interface behaves.


🖥️ Host / build context

Linux ... 6.12.46 #1 SMP Fri Oct  3 18:06:14 UTC 2025 aarch64 Linux
OpenVPN 2.6.14 aarch64-unknown-linux-musl [SSL (OpenSSL)] [LZO] [LZ4] [EPOLL] [MH/PKTINFO] [AEAD]
library versions: OpenSSL 3.3.5 30 Sep 2025, LZO 2.10
Compile-time options include (abridged): enable_lz4=yes, enable_lzo=yes, enable_management=yes, enable_iproute2=yes
Alpine Linux 3.21.5 (musl 1.2.5)

📈 Process memory and runtime (~19h uptime, same process)

ps -o pid,comm,rss,vsz,etime | grep openvpn
1 openvpn 30m 32m 19h33
/proc/1/status (abridged)
VmSize:   33688 kB
VmRSS:    31408 kB
VmData:   26516 kB
Threads:  1
/proc/1/limits (abridged)
Max resident set          unlimited
Max processes             unlimited
Max open files            65536 / 1048576

🧵 Threads / LWPs snapshot

/proc/1/status → Threads: 1
ps -T | wc -l → 5973

(Shows a single thread reported in /proc, but nearly 6000 LWPs visible via ps -T.)


⚙️ Management interface (IPs and real addresses redacted)

status
Updated: 2025-11-05 14:07:44
GLOBAL STATS
Max bcast/mcast queue length,98
END
load-stats
SUCCESS: nclients=93, bytesin=484940838, bytesout=244827065
  • Client count stable (~93).
  • Routing table and connection times unchanged since the previous day.
  • When the issue occurs (around 48h mark), the management socket typically stops replying to status or load-stats commands — I will confirm this on the next occurrence.

🔁 Recent TLS activity (redacted for IPs)

During this period, many peers show periodic TLS: soft reset events followed by VERIFY OK and full revalidation:

2025-11-05 14:01:54 [CN=grpXXXX] TLS: soft reset sec=3262/3262
2025-11-05 14:02:24 [CN=grpXXXX] TLS: soft reset sec=3510/3510
2025-11-05 14:02:29 [CN=grpXXXX] TLS: soft reset sec=3344/3343
2025-11-05 14:02:37 [CN=grpXXXX] TLS: soft reset sec=3320/3319
2025-11-05 14:02:40 [CN=grpXXXX] TLS: soft reset sec=3511/3510
...

Each reset is followed by VERIFY OK and Control Channel: TLSv1.3, cipher TLS_AES_256_GCM_SHA384.


🔁 Repeated MULTI: Learn entries (sanitized examples)

Several peers show recurring IP learn messages for sessions that stay connected:

2025-11-05 13:36:xx [CN=grpXXXX] MULTI: primary virtual IP ...: 172.27.224.xx  [real addr redacted]
2025-11-05 14:07:xx [CN=grpXXXX] MULTI: primary virtual IP ...: 172.27.224.xx  [real addr redacted]

These repeat for the same CN and internal IP even though no explicit reconnect is logged.


Summary

  • Memory and thread count continue to grow while the number of connected clients remains constant (~93).
  • Management interface currently responds normally; typically becomes unresponsive after ~48h.
  • Multiple periodic TLS soft resets and repeated MULTI: Learn entries for existing sessions observed.
  • I’ll provide updated metrics and management output at the next unresponsive event.

yvanitou-amical avatar Nov 05 '25 14:11 yvanitou-amical

thread count continue to grow

This is very peculiar. OpenVPN itself is single threaded. And with the lwp thread count, you mean something like ps -auxT |grep openvpn right? I wonder what the 6k threads you have on your system are.

2025-11-05 13:36:xx [CN=grpXXXX] MULTI: primary virtual IP ...: 172.27.224.xx [real addr redacted]

There is only one occurance of that string in OpenVPN ("MULTI: primary virtual IP for %s: %s",) and that is in multi_client_connect_late_setup and that gets executed only once per client connect.

So this should not be repeated one than once per client.

Could you show the full server and logs?

Are you using any plugins in OpenVPN?

schwabe avatar Nov 05 '25 14:11 schwabe

Thanks @schwabe for the clarification regarding data-ciphers-fallback. You're absolutely right — I had originally added it a while ago when testing a few older client devices, but all current clients now support GCM.

I’ve removed the fallback directive from the configuration and will redeploy it after the next crash occurs (to keep the current process running until then). Appreciate the recommendation!

yvanitou-amical avatar Nov 05 '25 14:11 yvanitou-amical

thread count continue to grow

This is very peculiar. OpenVPN itself is single threaded. And with the lwp thread count, you mean something like ps -auxT |grep openvpn right? I wonder what the 6k threads you have on your system are.

2025-11-05 13:36:xx [CN=grpXXXX] MULTI: primary virtual IP ...: 172.27.224.xx [real addr redacted]

There is only one occurance of that string in OpenVPN ("MULTI: primary virtual IP for %s: %s",) and that is in multi_client_connect_late_setup and that gets executed only once per client connect.

So this should not be repeated one than once per client.

Could you show the full server and logs?

Are you using any plugins in OpenVPN?

Thanks, @schwabe — here is a complete, sanitized follow-up with the concrete data you asked for. No config changes were made; I’m keeping the current process running to preserve state.


1) Thread count clarification (per-process vs system-wide)

My earlier “~6k threads” came from a system-wide BusyBox ps -T (my bad). Per-process LWP/thread counts for the OpenVPN daemon are:

/proc/1/status → Threads: 1
ls -1 /proc/1/task | wc -l → 1

So OpenVPN itself is single-threaded here, as expected.


2) Server configuration (abridged, sanitized)

No plugins and no hook scripts are used.

port 1194
proto udp
dev tun

ca /etc/openvpn/pki/ca.crt
cert /etc/openvpn/pki/issued/server.crt
key /etc/openvpn/pki/private/server.key
dh none

server 172.27.224.0 255.255.240.0
ifconfig-pool-persist /var/lib/openvpn/ipp.txt
topology subnet
push "topology subnet"
mssfix 1420
push "dhcp-option DNS 172.20.0.10"
push "route 172.20.0.0 255.255.0.0"
push "route 10.10.0.0 255.255.0.0"
client-to-client
duplicate-cn

keepalive 10 120
allow-compression no
tls-crypt /etc/openvpn/pki/tls-crypt.key

data-ciphers AES-256-GCM:AES-256-CBC:CHACHA20-POLY1305:AES-128-GCM
# data-ciphers-fallback was present earlier for legacy tests; removed and will redeploy later
auth SHA256

tls-version-min 1.2
tls-version-max 1.3
tls-groups secp384r1:secp521r1:secp256r1:x25519
reneg-sec 3600

tls-cert-profile preferred
tls-cipher TLS-ECDHE-ECDSA-WITH-AES-256-GCM-SHA384:TLS-ECDHE-RSA-WITH-AES-256-GCM-SHA384:TLS-ECDHE-ECDSA-WITH-CHACHA20-POLY1305-SHA256:TLS-ECDHE-RSA-WITH-CHACHA20-POLY1305-SHA256
remote-cert-tls client

sndbuf 0
rcvbuf 0
fast-io

user openvpn
group openvpn
persist-key
persist-tun

management 0.0.0.0 7505
verb 3
explicit-exit-notify 1

Plugins/scripts: none (no --plugin, no client-connect, client-disconnect, learn-address, etc.).


3) Build / OS context

Linux … 6.12.46 aarch64
OpenVPN 2.6.14 aarch64-unknown-linux-musl
library versions: OpenSSL 3.3.5, LZO 2.10
Alpine Linux 3.21.5 (musl 1.2.5)

4) Management snapshots (status 3 + load-stats, sanitized)

(Real/remote addresses redacted; counts/fields kept.)

TITLE  OpenVPN 2.6.14 aarch64-unknown-linux-musl
TIME   2025-11-05 14:51:59

# CLIENT_LIST excerpt (CN, Virtual Address, Bytes, Connected Since, Cipher)
CLIENT_LIST  grp2601w-prod  172.27.224.xx  …  2025-11-04/05  …  AES-128-GCM
… (93 clients total; all show Data Channel Cipher = AES-128-GCM)

# ROUTING_TABLE excerpt (virtual + CN only)
ROUTING_TABLE  172.27.224.xx  grp2601w-prod   [redacted]  2025-11-05 14:51:xx
… (entries present for all connected clients)

GLOBAL_STATS
Max bcast/mcast queue length  98
dco_enabled                   0
END

load-stats
SUCCESS: nclients=93, bytesin=499927261, bytesout=255713146

5) Recent logs (sanitized)

TLS soft resets: I see periodic TLS: soft reset events for multiple peers, each followed by full VERIFY OK and TLS 1.3 control channel lines:

2025-11-05 14:43:37 [CN=grpXXXX] TLS: soft reset sec=3309/3309
2025-11-05 14:43:37 [CN=grpXXXX] VERIFY OK … ; Control Channel: TLSv1.3 TLS_AES_256_GCM_SHA384

2025-11-05 14:43:47 [CN=grpXXXX] TLS: soft reset sec=3424/3424
2025-11-05 14:43:55 [CN=grpXXXX] TLS: soft reset sec=3304/3304
2025-11-05 14:44:55 [CN=grpXXXX] TLS: soft reset sec=3307/3307
2025-11-05 14:45:09 [CN=grpXXXX] TLS: soft reset sec=3465/3465
… (similar pattern within the same window)

MULTI around a reconnect: A clear reconnect shows the expected sequence once per connect:

2025-11-05 14:48:27 [CN=grpXXXX] tls_multi_process: initial untrusted session promoted to trusted
2025-11-05 14:48:27 [CN=grpXXXX] MULTI_sva: pool returned IPv4=172.27.224.xx
2025-11-05 14:48:27 [CN=grpXXXX] MULTI: Learn: 172.27.224.xx -> [redacted]
2025-11-05 14:48:27 [CN=grpXXXX] MULTI: primary virtual IP for [CN]: 172.27.224.xx

Regarding your note that "MULTI: primary virtual IP …" is only logged once per client connect: understood. In this recent sample, I only see it once for that reconnect. I will keep watching to see if the same CN ever logs another primary virtual IP line without an intervening reconnect indication and will post that if it occurs.

yvanitou-amical avatar Nov 05 '25 15:11 yvanitou-amical

Additional factual observations (logs + metrics + periodicity)

  • Precise timestamp (UTC / local) At 2025-11-10 03:06:25 UTC (local Nov 09 22:06:26), a burst of consecutive events occurred:

    2025-11-10 03:06:25 SENT CONTROL [grp2601w-prod]: 'RESTART' (status=1)
    2025-11-10 03:06:25 SENT CONTROL [grp2601w-prod]: 'RESTART' (status=1)
    2025-11-10 03:06:25 SENT CONTROL [grp2601w-prod]: 'RESTART' (status=1)
    2025-11-10 03:06:25 SENT CONTROL [grp2601w-prod]: 'RESTART' (status=1)
    2025-11-10 03:06:25 event_wait : Interrupted system call (fd=-1,code=4)
    

    followed immediately by multiple UDP connection refusals:

    2025-11-10 03:06:21 read UDPv4 [ECONNREFUSED]: Connection refused (fd=6,code=111)
    2025-11-10 03:06:11 read UDPv4 [ECONNREFUSED]: Connection refused (fd=6,code=111)
    2025-11-10 03:06:01 read UDPv4 [ECONNREFUSED]: Connection refused (fd=6,code=111)
    2025-11-10 03:05:51 read UDPv4 [ECONNREFUSED]: Connection refused (fd=6,code=111)
    2025-11-10 03:05:41 read UDPv4 [ECONNREFUSED]: Connection refused (fd=6,code=111)
    2025-11-10 03:05:30 read UDPv4 [ECONNREFUSED]: Connection refused (fd=6,code=111)
    
  • Metrics behavior (2-day window)

    • Memory usage starts at around 27 MiB and increases gradually up to 168 MiB before resetting every 24 hours.
    • CPU usage also increases, following an exponential curve, with peaks becoming more frequent right before each restart.
    • Other metrics (storage, network) remain stable and show no anomalies at the time of reset.
  • Periodicity This pattern now occurs every 24 hours with the current load (~80 devices connected). Earlier in the summer, when fewer devices were connected, the interval was approximately 48 hours.

I’ve also shared the detailed metrics and graphs privately (CPU/memory curves and raw data) in case they help with further analysis.

yvanitou-amical avatar Nov 10 '25 15:11 yvanitou-amical

The "TLS: soft reset" messages happen because you ordered them - reneg-sec 3600 does exactly this: start a TLS renegotiation every 3600 seconds (+/- a bit of randomness).

The "burst of identical events" (RESTART) is basically explict-exit-notify 1 triggering, telling the clients that the server is going to exit now - once per client. Why this happens is in the few lines above that you did not include. So with these incomplete logs, we can't really do much of an analysis.

cron2 avatar Nov 10 '25 23:11 cron2