openvpn icon indicating copy to clipboard operation
openvpn copied to clipboard

pkt_testdriver segmentation fault on Debian Unstable arm64 on AWS Graviton 7

Open mattock opened this issue 11 months ago • 10 comments
trafficstars

Describe the bug

The pkt_testdriver test fails due to a segmentation fault on latest Git "master" OpenVPN on Debian Unstable arm64. This does not seem to affect the Debian Unstable on amd64.

To Reproduce

This should be reproduceable with latest Debian Unstable Docker hub container images simply by running the usual build procedure followed by "make check". Alternatively updating the Debian Unstable arm64 base image to the latest upstream (debian) version in Buildbot should trigger this behavior.

Expected behavior

The pkt_testdriver test should pass and not segfault.

Version information (please complete the following information):

  • OS: Debian Unstable arm64
  • OpenVPN version: Git master

Additional context

stdio.txt

mattock avatar Dec 04 '24 07:12 mattock

Can you run the testdriver from gdb & gather a backtrace ("where") when it crashes?

$ cd tests/unit_tests/openvpn
$ gdb pkt_testdriver
...
(gdb) run
...
(gdb) where

cron2 avatar Dec 04 '24 07:12 cron2

Can you run the testdriver from gdb & gather a backtrace ("where") when it crashes?

$ cd tests/unit_tests/openvpn
$ gdb pkt_testdriver
...
(gdb) run
...
(gdb) where

Sure, I will.

mattock avatar Dec 04 '24 10:12 mattock

As discussed in today's community meeting I'll test this immediately again and rebuild the Debian system. If the problem still exists I shall wait for a week or two. This could just be a result of a broken compiler or such which might be fixed or get a fix shortly - the packages in Debian unstable get updated very frequently.

mattock avatar Dec 04 '24 13:12 mattock

Rebuilding the Debian unstable arm64 container image from scratch did not help. I shall retry later.

mattock avatar Dec 05 '24 05:12 mattock

The failure is still present after a forced rebuild of the Debian unstable arm64 image. I'll go the gdb route.

mattock avatar Dec 10 '24 07:12 mattock

gdb logs:

(gdb) run
Starting program: /tmp/openvpn/tests/unit_tests/openvpn/pkt_testdriver 
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
[==========] pkt tests: Running 10 test(s).
[ RUN      ] test_tls_decrypt_lite_none
[       OK ] test_tls_decrypt_lite_none
[ RUN      ] test_tls_decrypt_lite_auth
[       OK ] test_tls_decrypt_lite_auth
[ RUN      ] test_tls_decrypt_lite_crypt

Program received signal SIGSEGV, Segmentation fault.
0x0000b5a4e11745c8 in tls_pre_decrypt_lite (tas=tas@entry=0xffffd7ab2670, state=state@entry=0xffffd7ab2420, from=from@entry=0xffffd7ab23f0, buf=buf@entry=0xffffd7ab23d8) at ../../../src/openvpn/ssl_pkt.c:323
323         uint8_t pkt_firstbyte = *BPTR(buf);
(gdb) where
#0  0x0000b5a4e11745c8 in tls_pre_decrypt_lite (tas=tas@entry=0xffffd7ab2670, state=state@entry=0xffffd7ab2420, from=from@entry=0xffffd7ab23f0, buf=buf@entry=0xffffd7ab23d8) at ../../../src/openvpn/ssl_pkt.c:323
#1  0x0000b5a4e11653d8 in test_tls_decrypt_lite_crypt (ut_state=<optimized out>) at test_pkt.c:257
#2  0x0000e043d5ca58dc in ?? () from /lib/aarch64-linux-gnu/libcmocka.so.0
#3  0x0000e043d5ca5ed8 [PAC] in _cmocka_run_group_tests () from /lib/aarch64-linux-gnu/libcmocka.so.0
#4  0x0000b5a4e11638b4 [PAC] in main () at test_pkt.c:679

mattock avatar Dec 10 '24 08:12 mattock

As suggested by @cron2 I compiled with -O0 and then things magically started working:

(gdb) run
Starting program: /tmp/openvpn/tests/unit_tests/openvpn/pkt_testdriver 
warning: Error disabling address space randomization: Operation not permitted
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/aarch64-linux-gnu/libthread_db.so.1".
[==========] pkt tests: Running 10 test(s).
[ RUN      ] test_tls_decrypt_lite_none
[       OK ] test_tls_decrypt_lite_none
[ RUN      ] test_tls_decrypt_lite_auth
[       OK ] test_tls_decrypt_lite_auth
[ RUN      ] test_tls_decrypt_lite_crypt
[       OK ] test_tls_decrypt_lite_crypt
[ RUN      ] test_parse_ack
[       OK ] test_parse_ack
[ RUN      ] test_calc_session_id_hmac_static
[       OK ] test_calc_session_id_hmac_static
[ RUN      ] test_verify_hmac_none
[       OK ] test_verify_hmac_none
[ RUN      ] test_verify_hmac_tls_auth
[       OK ] test_verify_hmac_tls_auth
[ RUN      ] test_generate_reset_packet_plain
[       OK ] test_generate_reset_packet_plain
[ RUN      ] test_generate_reset_packet_tls_auth
[       OK ] test_generate_reset_packet_tls_auth
[ RUN      ] test_extract_control_message
[       OK ] test_extract_control_message
[==========] pkt tests: 10 test(s) run.
[  PASSED  ] 10 test(s).
[Inferior 1 (process 26706) exited normally]

mattock avatar Dec 10 '24 09:12 mattock

Tests with different AWS instances and macs seem to suggest that this crash only happens on 7th generation Graviton instances. We have seen it on m7g and c7g AWS instances, but not on any other ARM machines, e.g. not on c6g instances.

flichtenheld avatar Dec 11 '24 13:12 flichtenheld

(gdb) print buf
$19 = {capacity = 1024, offset = 1450110702, len = 54, 
  data = 0xbadf1298c5e0 "8\364#\313\022\321\371\344\217"}

flichtenheld avatar Dec 11 '24 14:12 flichtenheld

pkt_testdriver-test_pkt.txt

Assembler

flichtenheld avatar Dec 11 '24 15:12 flichtenheld

Is this still a thing? Aka, is it still crashing, is anyone working on it, or shall we just close the issue and claim "it magically fixed itself"?

cron2 avatar Sep 08 '25 15:09 cron2