timescaledb
timescaledb copied to clipboard
[Bug]: DEADLOCK when multiple telegraf clients are connected at once
What type of bug is this?
Crash, Data corruption, Locking issue, Unexpected error
What subsystems and features are affected?
Data node
What happened?
We have ~400 hosts where we install Telegraf with PostgreSQL output. With an empty PostgreSQL with Timescaledb, after we spawned more than one Telegraf at once (on client machines), it's causing a deadlock in PostgreSQL database after which everything stops working.
TimescaleDB version affected
2.6.0
PostgreSQL version used
13.6
What operating system did you use?
FreeBSD 13.0
What installation method did you use?
Source
What platform did you run on?
On prem/Self-hosted
Relevant log output and stack trace
2022-03-07 11:43:24.298 CET [63934] ERROR: deadlock detected
2022-03-07 11:43:24.298 CET [63934] DETAIL: Process 63934 waits for AccessExclusiveLock on relation 57025 of database 56403; blocked by process 63868.
Process 63868 waits for AccessShareLock on relation 57044 of database 56403; blocked by process 63934.
Process 63934: ALTER TABLE "public"."nstat" ADD COLUMN IF NOT EXISTS "IcmpInParmProbs" bigint, ADD COLUMN IF NOT EXISTS "IcmpMsgInType12" bigint, ADD COLUMN IF NOT EXISTS "IpReasmOKs" bigint, ADD COLUMN IF NOT EXISTS "IpReasmReqds" bigint, ADD COLUMN IF NOT EXISTS "TcpExtLockDroppedIcm
ps" bigint, ADD COLUMN IF NOT EXISTS "TcpExtSyncookiesFailed" bigint, ADD COLUMN IF NOT EXISTS "TcpExtTCPRenoFailures" bigint
Process 63868: select "time", "tag_id", "err_in", "err_out", "drop_in", "drop_out", "bytes_sent", "bytes_recv", "packets_sent", "packets_recv", "icmp_inaddrmasks", "icmp_outtimeexcds", "icmp_inechos", "icmpmsg_intype0", "tcp_inerrs", "tcp_outrsts", "tcp_rtoalgorithm", "ip_reasmoks", "i
cmp_insrcquenchs", "icmp_intimeexcds", "icmpmsg_outtype8", "icmpmsg_outtype14", "tcp_currestab", "udp_noports", "udplite_sndbuferrors", "ip_reasmtimeout", "udplite_incsumerrors", "icmp_outechoreps", "icmp_inparmprobs", "icmp_outparmprobs", "icmpmsg_intype3", "tcp_maxconn", "udp_sndbuferrors",
"udp_inerrors", "ip_fragcreates", "udplite_noports", "icmpmsg_outtype0", "tcp_attemptfails", "icmp_outsrcquenchs", "icmp_outtimestampreps", "icmpmsg_intype8", "tcp_outsegs", "tcp_retranssegs", "tcp_rtomax", "udp_outdatagrams", "udplite_indatagrams", "icmp_inredirects", "ip_fragoks", "icmp_outa
ddrmasks", "icmp_outredirects", "icmp_inerrors", "ip_indiscards", "tcp_activeopens", "ip_fragfails", "ip_reasmfails", "ip_inaddrerrors", "icmp_inaddrmaskreps", "icm
2022-03-07 11:43:24.298 CET [63934] HINT: See server log for query details.
2022-03-07 11:43:24.298 CET [63934] STATEMENT: ALTER TABLE "public"."nstat" ADD COLUMN IF NOT EXISTS "IcmpInParmProbs" bigint, ADD COLUMN IF NOT EXISTS "IcmpMsgInType12" bigint, ADD COLUMN IF NOT EXISTS "IpReasmOKs" bigint, ADD COLUMN IF NOT EXISTS "IpReasmReqds" bigint, ADD COLUMN IF NOT EXI
STS "TcpExtLockDroppedIcmps" bigint, ADD COLUMN IF NOT EXISTS "TcpExtSyncookiesFailed" bigint, ADD COLUMN IF NOT EXISTS "TcpExtTCPRenoFailures" bigint
2022-03-07 11:43:29.281 CET [63934] ERROR: column "icmpmsg_intype12" does not exist at character 1306
How can we reproduce the bug?
Just start more than one Telegraf with PostgreSQL output.
We worked around this issue by just adding "sleep 60" for our Telegraf initialization script, but with +400 hosts it's a crazy approach.
We used Telegraf from postgres branch, so 1.22.0-e27281f3
It crashes like that on a lot of things, here's MySQL input:
2022-03-07 13:30:57.386 CET [77673] ERROR: deadlock detected
2022-03-07 13:30:57.386 CET [77673] DETAIL: Process 77673 waits for AccessExclusiveLock on relation 211972 of database 211344; blocked by process 77541.
Process 77541 waits for ExclusiveLock on advisory lock [211344,1328403803,908883257,1]; blocked by process 77673.
Process 77673: ALTER TABLE "public"."mysql" ADD COLUMN IF NOT EXISTS "aborted_connects_preauth" bigint, ADD COLUMN IF NOT EXISTS "acl_package_body_grants" bigint, ADD COLUMN IF NOT EXISTS "acl_package_spec_grants" bigint, ADD COLUMN IF NOT EXISTS "column_compressions" bigint, ADD COLUMN IF NOT EXISTS "column_decompressions" bigint, ADD COLUMN IF NOT EXISTS "com_alter_sequence" bigint, ADD COLUMN IF NOT EXISTS "com_alter_user" bigint, ADD COLUMN IF NOT EXISTS "com_backup" bigint, ADD COLUMN IF NOT EXISTS "com_backup_lock" bigint, ADD COLUMN IF NOT EXISTS "com_create_package" bigint, ADD COLUMN IF NOT EXISTS "com_create_package_body" bigint, ADD COLUMN IF NOT EXISTS "com_create_sequence" bigint, ADD COLUMN IF NOT EXISTS "com_drop_package" bigint, ADD COLUMN IF NOT EXISTS "com_drop_package_body" bigint, ADD COLUMN IF NOT EXISTS "com_drop_sequence" bigint, ADD COLUMN IF NOT EXISTS "com_execute_immediate" bigint, ADD COLUMN IF NOT EXISTS "com_multi" bigint, ADD COLUMN IF NOT EXISTS "com_show_binlog_status" bigint, ADD COLUMN IF
Process 77541: SELECT pg_advisory_xact_lock($1)
2022-03-07 13:30:57.386 CET [77673] HINT: See server log for query details.
It's super easy to reproduce…
Here on input.interrupts:
2022-03-07 13:24:09.525 CET [76407] ERROR: deadlock detected
2022-03-07 13:24:09.525 CET [76407] DETAIL: Process 76407 waits for AccessExclusiveLock on relation 205806 of database 205076; blocked by process 76450.
Process 76450 waits for AccessShareLock on relation 205698 of database 205076; blocked by process 76407.
Process 76407: ALTER TABLE "public"."interrupts" ADD COLUMN IF NOT EXISTS "CPU2" bigint, ADD COLUMN IF NOT EXISTS "CPU3" bigint
Process 76450: select "time", "device", "host", "irq", "type", "total", "CPU0", "CPU1" from "public"."soft_interrupts"
and
2022-03-07 13:23:21.798 CET [76225] ERROR: deadlock detected
2022-03-07 13:23:21.798 CET [76225] DETAIL: Process 76225 waits for AccessExclusiveLock on relation 205698 of database 205076; blocked by process 76412.
Process 76412 waits for ExclusiveLock on advisory lock [205076,1328403803,908883257,1]; blocked by process 76225.
Process 76225: ALTER TABLE "public"."soft_interrupts" ADD COLUMN IF NOT EXISTS "CPU2" bigint, ADD COLUMN IF NOT EXISTS "CPU3" bigint, ADD COLUMN IF NOT EXISTS "CPU4" bigint, ADD COLUMN IF NOT EXISTS "CPU5" bigint
Process 76412: SELECT pg_advisory_xact_lock($1)
Also if you have a really big and long running database and you gather MySQL inputs from it it may end with error like this:
ERROR: COPY from stdin failed: 18446744073709551615 is greater than maximum value for Int8
Workaround:
Put this into telegraf.conf in [[inputs.mysql]]
section:
fielddrop = ["wsrep_local_index","wsrep_cluster_conf_id","tokudb_flusher_cleaner_min_buffer_size","tokudb_flusher_cleaner_min_buffer_workdone"]
Deadlocks are generally client issues.
Since you're using telegraf, and I see you recently posted on https://github.com/influxdata/telegraf/pull/8651, can you confirm this is the plugin you are using? The issues you're mentioning should not occur on that plugin (but may be possible on the old deprecated plugin). The #8651 plugin grabs a global lock before performing any schema modifications which should prevent deadlocks. And the is greater than maximum value for Int8
issue should be mitigated as the plugin uses "numeric" data type for unsigned 64-bit integer. This is why I'd like to confirm which plugin you are using. If that is the plugin you are using, then this is really interesting. If there's anything unusual about your setup that you can think of (besides just a large number of processes), it's be great if you could share it (and share it over on #8651, not here).
Thanks
@dmilith Hey, this does not seem to be related to the timescaledb directly. Do you see those deadlocks on a regular PostgreSQL instance (without timescaledb installed) too?
Hi @dmilith , just checking in on this issue. Have you been able to try the above suggestion and see if the deadlock also occurs then? Thanks!
Dear Author,
This issue has been automatically marked as stale due to lack of activity. With only the issue description that is currently provided, we do not have enough information to take action. If you have or find the answers we would need, please reach out. Otherwise, this issue will be closed in 30 days. Thank you!
Dear Author,
We are closing this issue due to lack of activity. Feel free to add a comment to this issue if you can provide more information and we will re-open it. Thank you!