DIRAC icon indicating copy to clipboard operation
DIRAC copied to clipboard

Core dump on long running queries

Open chaen opened this issue 2 years ago • 2 comments

When running long lasting queries against a DB on a loaded server, we get a core dump of python. The cut is exactly at 600 seconds. When running just that query on an isolated server, it works fine. I did not have time to dig deep into yet, but LHCb is impacted in production

The client just receives a blunt

"Error in _read: (104, 'Connection reset by peer') SSLError(104, 'Connection reset by peer')"

The full stacktrace is attached, but here are the relevant info

*** Error in `/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11': double free or corruption (out): 0x00007fcbfc0acba0 ***
*** Error in `/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11': corrupted double-linked list: 0x00007fcbfc093ce0 ***
======= Backtrace: =========
======= Backtrace: =========
/lib64/libc.so.6(+0x81329)[0x7fcc3e95f329]
/lib64/libc.so.6(+0x8097f)[0x7fcc3e95e97f]
/lib64/libc.so.6(+0x8120e)[0x7fcc3e95f20e]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/lib/python3.11/site-packages/MySQLdb/../../../libmysqlclient.so.21(+0x8d904)[0x7fcc2fa13904]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/lib/python3.11/site-packages/MySQLdb/../../../libmysqlclient.so.21(+0x304bb)[0x7fcc2f9b64bb]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/lib/python3.11/site-packages/MySQLdb/../../../libmysqlclient.so.21(+0x3551c)[0x7fcc2f9bb51c]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/lib/python3.11/site-packages/M2Crypto/../../../libssl.so.3(SSL_free+0x14e)[0x7fcc36d9780e]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/lib/python3.11/site-packages/MySQLdb/../../../libmysqlclient.so.21(+0x87ca9)[0x7fcc2fa0dca9]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/lib/python3.11/site-packages/MySQLdb/../../../libmysqlclient.so.21(+0x334ab)[0x7fcc2f9b94ab]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/lib/python3.11/site-packages/MySQLdb/../../../libmysqlclient.so.21(+0x33766)[0x7fcc2f9b9766]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/lib/python3.11/site-packages/MySQLdb/../../../libmysqlclient.so.21(+0x3de70)[0x7fcc2f9c3e70]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/lib/python3.11/site-packages/MySQLdb/../../../libmysqlclient.so.21(mysql_ping+0x26)[0x7fcc2f9b2216]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/lib/python3.11/site-packages/MySQLdb/_mysql.cpython-311-x86_64-linux-gnu.so(+0x754b)[0x7fcc37b0454b]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(+0x200d68)[0x55bd8eb0ed68]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(_PyObject_MakeTpCall+0x263)[0x55bd8eaeee13]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(_PyEval_EvalFrameDefault+0x81d)[0x55bd8eafb85d]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/lib/python3.11/site-packages/MySQLdb/../../../libmysqlclient.so.21(mysql_store_result+0xb0)[0x7fcc2f9c0590]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/lib/python3.11/site-packages/MySQLdb/_mysql.cpython-311-x86_64-linux-gnu.so(+0x6ae4)[0x7fcc37b03ae4]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/lib/python3.11/site-packages/MySQLdb/_mysql.cpython-311-x86_64-linux-gnu.so(+0x70dc)[0x7fcc37b040dc]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(+0x1e474e)[0x55bd8eaf274e]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(PyObject_Vectorcall+0x2c)[0x55bd8eb084cc]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(+0x230a74)[0x55bd8eb3ea74]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(+0x2307b5)[0x55bd8eb3e7b5]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(_PyEval_EvalFrameDefault+0x438f)[0x55bd8eaff3cf]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(+0x230a74)[0x55bd8eb3ea74]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(_PyEval_EvalFrameDefault+0x81d)[0x55bd8eafb85d]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(+0x2307b5)[0x55bd8eb3e7b5]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(_PyEval_EvalFrameDefault+0x438f)[0x55bd8eaff3cf]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(+0x230a74)[0x55bd8eb3ea74]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(+0x2307b5)[0x55bd8eb3e7b5]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(_PyEval_EvalFrameDefault+0x438f)[0x55bd8eaff3cf]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(_PyFunction_Vectorcall+0x181)[0x55bd8eb1e021]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(_PyEval_EvalFrameDefault+0x438f)[0x55bd8eaff3cf]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(+0x230a74)[0x55bd8eb3ea74]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(+0x230838)[0x55bd8eb3e838]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(+0x30547b)[0x55bd8ec1347b]
/opt/dirac/versions/v11.0.18-1691646408/Linux-x86_64/bin/python3.11(+0x2d0a04)[0x55bd8ebdea04]
/lib64/libpthread.so.0(+0x7ea5)[0x7fcc3f5c4ea5]
/lib64/libc.so.6(clone+0x6d)[0x7fcc3e9dcb0d]

A quick attempt at an strace output shows puzzling things: two different threads trying to shutdown and close the very same socket.

28802 15:00:40.758524 <... read resumed>"\362\320Lx\325", 5) = 5 <0.298853>
28994 15:00:40.758583 write(12<TCP:[137.138.150.133:44330->10.30.136.35:5506]>, "\27\3\3\0\23\344\215h\253\224\342\255\213\35 \326Do\246\214_\310\345>", 24 <unfinished ...>
28802 15:00:40.758616 shutdown(12<TCP:[137.138.150.133:44330->10.30.136.35:5506]>, SHUT_RDWR <unfinished ...>
28994 15:00:40.758697 <... write resumed>) = 24 <0.000087>
28802 15:00:40.758719 <... shutdown resumed>) = 0 <0.000083>
28994 15:00:40.758751 shutdown(12<TCP:[137.138.150.133:44330->10.30.136.35:5506]>, SHUT_RDWR <unfinished ...>
28802 15:00:40.758774 close(12<TCP:[137.138.150.133:44330->10.30.136.35:5506]> <unfinished ...>
28994 15:00:40.758795 <... shutdown resumed>) = 0 <0.000025>
28994 15:00:40.758873 close(12 <unfinished ...>
28802 15:00:40.758905 <... close resumed>) = 0 <0.000115>
28994 15:00:40.758923 <... close resumed>) = -1 EBADF (Bad file descriptor) <0.000024>

My gut feeling is that the bug has always been there, but python 3.11 optimizations exhibit it. stack_trace.log

CTA (cc @arrabito ) observes a similar behavior. Btw @arrabito , do you observe it on python3.11 or 3.9 ?

chaen avatar Aug 30 '23 08:08 chaen

I found a reproducer and probably a fix: https://github.com/DIRACGrid/DIRAC/pull/7342

import MySQLdb
import random
from concurrent.futures import ThreadPoolExecutor
host="db.cern.ch"
port=5506
user="BabyShark"
passwd="Tududududu"
conn = MySQLdb.connect(host=host, port=port, user=user, passwd=passwd)


def execute(rnd):
    cur = conn.cursor()
    cur.execute(f"SELECT {rnd}")
    print(cur.fetchall())
    cur.close()

with ThreadPoolExecutor(max_workers=10) as executor:
    executor.map(execute, [random.randint(0,10) for _ in range(20)])
 python /tmp/boom.py
((3,),)
((5,),)
((2,),)
((4,),)
((4,),)
((4,),)
((3,),)
((9,),)
((10,),)
((0,),)
((0,),)
*** Error in `python': double free or corruption (fasttop): 0x000055cc22fd8200 ***
Segmentation fault

chaen avatar Dec 05 '23 16:12 chaen

@chaen I don't remember if it was python 3.9 or 3.11. Anyway thank you for the fix. I confirm that we didn't observe the issue again recently. Now we are running 8.0.40 with python 3.11. I will let you if we observe the issue again.

arrabito avatar Apr 02 '24 09:04 arrabito