mysql2 icon indicating copy to clipboard operation
mysql2 copied to clipboard

Segmentation faults when running GraphQL tests (Rails, Minitest)

Open jcat4 opened this issue 1 year ago • 3 comments

Overview

We have some GraphQL mutation tests in a rails application that have been inconsistently failing with malloc/segfault errors. The more tests we run, the more likely the issues are to arise.

The hangs and freezes have been occurring locally on an M1 Pro MacBook Pro. These issues were also reproducible in our CI pipeline (docker containers, unsure which distro) and in a remote development container (Ubuntu 21.10, Kernel version 5.10.68)

We ran the tests through Valgrind on the Ubuntu container mentioned above and were able to capture a log of what happened.

Here is a notable snippet from the log:

==716395== Invalid read of size 4
==716395==    at 0xCD504A6: BIO_clear_flags (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==716395==    by 0xCD6056B: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==716395==    by 0xCD4E409: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==716395==    by 0xCD534B3: ??? (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==716395==    by 0xCD53596: BIO_write (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==716395==    by 0xCC46E80: ??? (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1)
==716395==    by 0xCC4D688: ??? (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1)
==716395==    by 0xCC4D954: ??? (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1)
==716395==    by 0xCC5C4A6: SSL_write (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1)
==716395==    by 0x11E560E5: ??? (in /usr/lib/x86_64-linux-gnu/libperconaserverclient.so.20.3.26)
==716395==    by 0x11E332D9: ??? (in /usr/lib/x86_64-linux-gnu/libperconaserverclient.so.20.3.26)
==716395==    by 0x11E334A6: ??? (in /usr/lib/x86_64-linux-gnu/libperconaserverclient.so.20.3.26)
==716395==    by 0x11E33ADF: ??? (in /usr/lib/x86_64-linux-gnu/libperconaserverclient.so.20.3.26)
==716395==    by 0x11E2E5EC: ??? (in /usr/lib/x86_64-linux-gnu/libperconaserverclient.so.20.3.26)
==716395==    by 0x11E2A9A5: mysql_send_query (in /usr/lib/x86_64-linux-gnu/libperconaserverclient.so.20.3.26)
==716395==    by 0x11DEBB57: nogvl_send_query (client.c:536)
==716395==    by 0x4AD3A54: rb_nogvl (thread.c:1687)
==716395==    by 0x11DEC060: do_send_query (client.c:544)
==716395==    by 0x49333DD: rb_vrescue2 (eval.c:903)
==716395==    by 0x493366D: rb_rescue2 (eval.c:884)
==716395==    by 0x11DEEA69: rb_mysql_query (client.c:812)
==716395==    by 0x4B08D26: vm_call_cfunc_with_frame (vm_insnhelper.c:3037)

==716395==  Address 0x1a1cca28 is 40 bytes inside a block of size 120 free'd
==716395==    at 0x484621F: free (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==716395==    by 0xCD594AC: BIO_free (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==716395==    by 0xCD5950B: BIO_free_all (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==716395==    by 0xCC64C6F: SSL_free (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1)
==716395==    by 0x11E56220: ??? (in /usr/lib/x86_64-linux-gnu/libperconaserverclient.so.20.3.26)
==716395==    by 0x11E2785D: ??? (in /usr/lib/x86_64-linux-gnu/libperconaserverclient.so.20.3.26)
==716395==    by 0x11E2E60D: ??? (in /usr/lib/x86_64-linux-gnu/libperconaserverclient.so.20.3.26)
==716395==    by 0x11E213C9: mysql_ping (in /usr/lib/x86_64-linux-gnu/libperconaserverclient.so.20.3.26)
==716395==    by 0x11DEBCDC: nogvl_ping (client.c:1149)
==716395==    by 0x4AD3A54: rb_nogvl (thread.c:1687)
==716395==    by 0x4B08D26: vm_call_cfunc_with_frame (vm_insnhelper.c:3037)

==716395==  Block was alloc'd at
==716395==    at 0x4843839: malloc (in /usr/libexec/valgrind/vgpreload_memcheck-amd64-linux.so)
==716395==    by 0xCDF4B3D: CRYPTO_zalloc (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==716395==    by 0xCD596A9: BIO_new (in /usr/lib/x86_64-linux-gnu/libcrypto.so.1.1)
==716395==    by 0xCC5B99B: SSL_set_fd (in /usr/lib/x86_64-linux-gnu/libssl.so.1.1)
==716395==    by 0x11E55E6E: ??? (in /usr/lib/x86_64-linux-gnu/libperconaserverclient.so.20.3.26)
==716395==    by 0x11E2D4D9: mysql_real_connect (in /usr/lib/x86_64-linux-gnu/libperconaserverclient.so.20.3.26)
==716395==    by 0x11DEBB99: nogvl_connect (client.c:232)
==716395==    by 0x4AD3A54: rb_nogvl (thread.c:1687)
==716395==    by 0x11DEDA79: rb_mysql_connect (client.c:467)

App Versions

Notable gem versions:

  • rails: 7.0.3.1
  • mysql2: 0.5.4
  • graphql: 2.0.14
  • minitest: 5.16.3

Ruby version: 3.1.2

I’m running MySQL on the container with Percona version 5.7.39-42

MySQL Config

Here is the contents of my contain’s /etc/my.cnf:

[mysql]

[mysqldump]

quick
quote-names
max_allowed_packet = 16M
set-gtid-purged    = OFF

[mysqld]

# GENERAL #
default-storage-engine         = InnoDB
character-set-server           = utf8mb4
collation-server               = utf8mb4_general_ci

# MyISAM #
key-buffer-size                = 32M

# SAFETY #
max-allowed-packet             = 16M
max-connect-errors             = 1000000

# CACHES AND LIMITS #
tmp-table-size                 = 32M
max-heap-table-size            = 32M
query-cache-type               = 0
query-cache-size               = 0
max-connections                = 50
thread-cache-size              = 8
open-files-limit               = 10000
table-definition-cache         = 2048
table-open-cache               = 2048

# INNODB #
innodb-flush-method            = O_DSYNC
innodb-log-files-in-group      = 2
innodb-flush-log-at-trx-commit = 2
innodb_flush_log_at_timeout    = 2700
innodb_doublewrite             = 0
innodb-buffer-pool-size        = 64M
innodb-checksum-algorithm      = none

# LOGGING #
log-queries-not-using-indexes  = 0
slow-query-log                 = 0

# REPLICATION
log_bin
server-id=1
binlog_format=ROW
log_bin_trust_function_creators = 1
max_binlog_files = 5
max_binlog_size = 10485760
binlog_rows_query_log_events = ON

# some other variables are set as arguments here: /containers/isospin/rootfs/etc/systemd/system/[email protected]

...And here is the contents of the [email protected] file mentioned in the previous config file:

[Unit]
Description=[Spin] MySQL {%i}
After=network.target
Requires=svcport@mysql-%i.service
After=svcport@mysql-%i.service

PartOf=spin-svc-units@%i.target
Before=spin-svc-units@%i.target

[Service]
Type=simple
TimeoutSec=30
Restart=on-failure
LimitNOFILE=10000
EnvironmentFile=/run/ports/%i/svc/.mysql.env
RuntimeDirectory=mysqld-%i
RuntimeDirectoryMode=755
StateDirectory=mysqld-%i
LogsDirectory=mysqld-%i
ExecStartPre=-/usr/sbin/mysqld \
  --initialize-insecure \
  --log-error=${LOGS_DIRECTORY}/error.log \
  --pid-file=${RUNTIME_DIRECTORY}/mysqld.pid \
  --socket=${RUNTIME_DIRECTORY}/mysqld.sock \
  --datadir=${STATE_DIRECTORY} \
  --sql_mode=NO_ENGINE_SUBSTITUTION \
  --show_compatibility_56=ON \
  --enforce-gtid-consistency=ON \
  --gtid-mode=ON \
  --innodb-file-per-table=0 \
  --innodb-log-file-size=4194304 \
  --explicit_defaults_for_timestamp=true \
  --port=${PORT} \
  --init-file=/opt/spin/lib/mysql/any-init.sql \
  --max-allowed-packet=16777216
ExecStart=/usr/sbin/mysqld \
  --log-error=${LOGS_DIRECTORY}/error.log \
  --pid-file=${RUNTIME_DIRECTORY}/mysqld.pid \
  --socket=${RUNTIME_DIRECTORY}/mysqld.sock \
  --datadir=${STATE_DIRECTORY} \
  --sql_mode=NO_ENGINE_SUBSTITUTION \
  --show_compatibility_56=ON \
  --enforce-gtid-consistency=ON \
  --gtid-mode=ON \
  --innodb-file-per-table=0 \
  --innodb-log-file-size=4194304 \
  --explicit_defaults_for_timestamp=true \
  --port=${PORT} \
  --max-allowed-packet=16777216

# Set enviroment variable MYSQLD_PARENT_PID. This is required for restart.
Environment=MYSQLD_PARENT_PID=1

# TODO(burke): Find a way to not have to give up CAP_SYS_ADMIN, so that we can
# continue to have this nice safeguard.
User=mysql
Group=mysql
; DynamicUser=true
; PrivateTmp=true
; ReadOnlyDirectories=/
; ReadWritePaths=-%L/mysqld-%i
; ReadWritePaths=-%S/mysqld-%i
; ReadWritePaths=-%t/mysqld-%i
; NoNewPrivileges=true
; CapabilityBoundingSet=CAP_SETGID CAP_SETUID CAP_SYS_RESOURCE
; MemoryDenyWriteExecute=true
; ProtectKernelModules=true
; ProtectKernelTunables=true
; ProtectControlGroups=true
; RestrictRealtime=true
; RestrictNamespaces=true
; RestrictAddressFamilies=AF_INET AF_INET6 AF_UNIX
; ProtectSystem=full

jcat4 avatar Sep 15 '22 18:09 jcat4

Looking at the malloc/free situation that Valgrind is showing, I wonder if you are running multithreaded code where one thread might be closing a connection while the other thread is expecting to continue using it?

Given that you're using the libperconaserverclient library, are you using community edition or do you have a support contract with them? That would be my first suggestion to see if you can get someone on the hook to figure out if their library is behaving differently than the upstream MySQL client library, or if they're linking to a different / newer version of OpenSSL or something along those lines.

sodabrew avatar Sep 15 '22 21:09 sodabrew

Thanks for the response!

I wonder if you are running multithreaded code

I tried to ensure the tests were not running in parallel. I added parallelize(workers: 1) to our test_helper and exported PARALLEL_WORKERS=1. After running the tests for a bit, I was still able to hit the segfault exception again. I don't think there would be any other threaded code running?

Given that you're using the libperconaserverclient library, are you using community edition or do you have a support contract with them?

I'm honestly not sure. I am struggling to find anything definitive myself, but I posted my question to some infra people. Hopefully I'll hear back soon.

jcat4 avatar Sep 16 '22 01:09 jcat4

We do not have a support contract, but we do have a lot of in-house experts that I'm talking to now. I'll try to update here if I hear anything new!

jcat4 avatar Sep 16 '22 16:09 jcat4