timescaledb [Bug]: DEADLOCK when multiple telegraf clients are connected at once

What type of bug is this?

Crash, Data corruption, Locking issue, Unexpected error

What subsystems and features are affected?

Data node

What happened?

We have ~400 hosts where we install Telegraf with PostgreSQL output. With an empty PostgreSQL with Timescaledb, after we spawned more than one Telegraf at once (on client machines), it's causing a deadlock in PostgreSQL database after which everything stops working.

TimescaleDB version affected

2.6.0

PostgreSQL version used

13.6

What operating system did you use?

FreeBSD 13.0

What installation method did you use?

Source

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

2022-03-07 11:43:24.298 CET [63934] ERROR:  deadlock detected
2022-03-07 11:43:24.298 CET [63934] DETAIL:  Process 63934 waits for AccessExclusiveLock on relation 57025 of database 56403; blocked by process 63868.
        Process 63868 waits for AccessShareLock on relation 57044 of database 56403; blocked by process 63934.
        Process 63934: ALTER TABLE "public"."nstat" ADD COLUMN IF NOT EXISTS "IcmpInParmProbs" bigint, ADD COLUMN IF NOT EXISTS "IcmpMsgInType12" bigint, ADD COLUMN IF NOT EXISTS "IpReasmOKs" bigint, ADD COLUMN IF NOT EXISTS "IpReasmReqds" bigint, ADD COLUMN IF NOT EXISTS "TcpExtLockDroppedIcm
ps" bigint, ADD COLUMN IF NOT EXISTS "TcpExtSyncookiesFailed" bigint, ADD COLUMN IF NOT EXISTS "TcpExtTCPRenoFailures" bigint
        Process 63868: select "time", "tag_id", "err_in", "err_out", "drop_in", "drop_out", "bytes_sent", "bytes_recv", "packets_sent", "packets_recv", "icmp_inaddrmasks", "icmp_outtimeexcds", "icmp_inechos", "icmpmsg_intype0", "tcp_inerrs", "tcp_outrsts", "tcp_rtoalgorithm", "ip_reasmoks", "i
cmp_insrcquenchs", "icmp_intimeexcds", "icmpmsg_outtype8", "icmpmsg_outtype14", "tcp_currestab", "udp_noports", "udplite_sndbuferrors", "ip_reasmtimeout", "udplite_incsumerrors", "icmp_outechoreps", "icmp_inparmprobs", "icmp_outparmprobs", "icmpmsg_intype3", "tcp_maxconn", "udp_sndbuferrors",
"udp_inerrors", "ip_fragcreates", "udplite_noports", "icmpmsg_outtype0", "tcp_attemptfails", "icmp_outsrcquenchs", "icmp_outtimestampreps", "icmpmsg_intype8", "tcp_outsegs", "tcp_retranssegs", "tcp_rtomax", "udp_outdatagrams", "udplite_indatagrams", "icmp_inredirects", "ip_fragoks", "icmp_outa
ddrmasks", "icmp_outredirects", "icmp_inerrors", "ip_indiscards", "tcp_activeopens", "ip_fragfails", "ip_reasmfails", "ip_inaddrerrors", "icmp_inaddrmaskreps", "icm
2022-03-07 11:43:24.298 CET [63934] HINT:  See server log for query details.
2022-03-07 11:43:24.298 CET [63934] STATEMENT:  ALTER TABLE "public"."nstat" ADD COLUMN IF NOT EXISTS "IcmpInParmProbs" bigint, ADD COLUMN IF NOT EXISTS "IcmpMsgInType12" bigint, ADD COLUMN IF NOT EXISTS "IpReasmOKs" bigint, ADD COLUMN IF NOT EXISTS "IpReasmReqds" bigint, ADD COLUMN IF NOT EXI
STS "TcpExtLockDroppedIcmps" bigint, ADD COLUMN IF NOT EXISTS "TcpExtSyncookiesFailed" bigint, ADD COLUMN IF NOT EXISTS "TcpExtTCPRenoFailures" bigint
2022-03-07 11:43:29.281 CET [63934] ERROR:  column "icmpmsg_intype12" does not exist at character 1306

How can we reproduce the bug?

Just start more than one Telegraf with PostgreSQL output.
We worked around this issue by just adding "sleep 60" for our Telegraf initialization script, but with +400 hosts it's a crazy approach.

Mar 07 '22 11:03 dmilith

We used Telegraf from postgres branch, so 1.22.0-e27281f3

Mar 07 '22 11:03 dmilith

It crashes like that on a lot of things, here's MySQL input:

2022-03-07 13:30:57.386 CET [77673] ERROR:  deadlock detected
2022-03-07 13:30:57.386 CET [77673] DETAIL:  Process 77673 waits for AccessExclusiveLock on relation 211972 of database 211344; blocked by process 77541.
        Process 77541 waits for ExclusiveLock on advisory lock [211344,1328403803,908883257,1]; blocked by process 77673.
        Process 77673: ALTER TABLE "public"."mysql" ADD COLUMN IF NOT EXISTS "aborted_connects_preauth" bigint, ADD COLUMN IF NOT EXISTS "acl_package_body_grants" bigint, ADD COLUMN IF NOT EXISTS "acl_package_spec_grants" bigint, ADD COLUMN IF NOT EXISTS "column_compressions" bigint, ADD COLUMN IF NOT EXISTS "column_decompressions" bigint, ADD COLUMN IF NOT EXISTS "com_alter_sequence" bigint, ADD COLUMN IF NOT EXISTS "com_alter_user" bigint, ADD COLUMN IF NOT EXISTS "com_backup" bigint, ADD COLUMN IF NOT EXISTS "com_backup_lock" bigint, ADD COLUMN IF NOT EXISTS "com_create_package" bigint, ADD COLUMN IF NOT EXISTS "com_create_package_body" bigint, ADD COLUMN IF NOT EXISTS "com_create_sequence" bigint, ADD COLUMN IF NOT EXISTS "com_drop_package" bigint, ADD COLUMN IF NOT EXISTS "com_drop_package_body" bigint, ADD COLUMN IF NOT EXISTS "com_drop_sequence" bigint, ADD COLUMN IF NOT EXISTS "com_execute_immediate" bigint, ADD COLUMN IF NOT EXISTS "com_multi" bigint, ADD COLUMN IF NOT EXISTS "com_show_binlog_status" bigint, ADD COLUMN IF
        Process 77541: SELECT pg_advisory_xact_lock($1)
2022-03-07 13:30:57.386 CET [77673] HINT:  See server log for query details.

It's super easy to reproduce…

Mar 07 '22 12:03 dmilith

Here on input.interrupts:

2022-03-07 13:24:09.525 CET [76407] ERROR:  deadlock detected
2022-03-07 13:24:09.525 CET [76407] DETAIL:  Process 76407 waits for AccessExclusiveLock on relation 205806 of database 205076; blocked by process 76450.
        Process 76450 waits for AccessShareLock on relation 205698 of database 205076; blocked by process 76407.
        Process 76407: ALTER TABLE "public"."interrupts" ADD COLUMN IF NOT EXISTS "CPU2" bigint, ADD COLUMN IF NOT EXISTS "CPU3" bigint
        Process 76450: select "time", "device", "host", "irq", "type", "total", "CPU0", "CPU1" from "public"."soft_interrupts"

and

2022-03-07 13:23:21.798 CET [76225] ERROR:  deadlock detected
2022-03-07 13:23:21.798 CET [76225] DETAIL:  Process 76225 waits for AccessExclusiveLock on relation 205698 of database 205076; blocked by process 76412.
        Process 76412 waits for ExclusiveLock on advisory lock [205076,1328403803,908883257,1]; blocked by process 76225.
        Process 76225: ALTER TABLE "public"."soft_interrupts" ADD COLUMN IF NOT EXISTS "CPU2" bigint, ADD COLUMN IF NOT EXISTS "CPU3" bigint, ADD COLUMN IF NOT EXISTS "CPU4" bigint, ADD COLUMN IF NOT EXISTS "CPU5" bigint
        Process 76412: SELECT pg_advisory_xact_lock($1)

Mar 07 '22 12:03 dmilith

Also if you have a really big and long running database and you gather MySQL inputs from it it may end with error like this:

ERROR: COPY from stdin failed: 18446744073709551615 is greater than maximum value for Int8

Workaround:

Put this into telegraf.conf in [[inputs.mysql]] section:

fielddrop = ["wsrep_local_index","wsrep_cluster_conf_id","tokudb_flusher_cleaner_min_buffer_size","tokudb_flusher_cleaner_min_buffer_workdone"]

Mar 08 '22 12:03 dmilith

Deadlocks are generally client issues. Since you're using telegraf, and I see you recently posted on https://github.com/influxdata/telegraf/pull/8651, can you confirm this is the plugin you are using? The issues you're mentioning should not occur on that plugin (but may be possible on the old deprecated plugin). The #8651 plugin grabs a global lock before performing any schema modifications which should prevent deadlocks. And the is greater than maximum value for Int8 issue should be mitigated as the plugin uses "numeric" data type for unsigned 64-bit integer. This is why I'd like to confirm which plugin you are using. If that is the plugin you are using, then this is really interesting. If there's anything unusual about your setup that you can think of (besides just a large number of processes), it's be great if you could share it (and share it over on #8651, not here).

Thanks

Mar 14 '22 00:03 phemmer

@dmilith Hey, this does not seem to be related to the timescaledb directly. Do you see those deadlocks on a regular PostgreSQL instance (without timescaledb installed) too?

Mar 14 '22 08:03 pmwkaa

Hi @dmilith , just checking in on this issue. Have you been able to try the above suggestion and see if the deadlock also occurs then? Thanks!

Jun 15 '22 08:06 konskov

Dear Author,

This issue has been automatically marked as stale due to lack of activity. With only the issue description that is currently provided, we do not have enough information to take action. If you have or find the answers we would need, please reach out. Otherwise, this issue will be closed in 30 days. Thank you!

Sep 22 '22 02:09 github-actions[bot]

Dear Author,

We are closing this issue due to lack of activity. Feel free to add a comment to this issue if you can provide more information and we will re-open it. Thank you!

Oct 23 '22 08:10 github-actions[bot]

timescaledb timescaledb copied to clipboard

[Bug]: DEADLOCK when multiple telegraf clients are connected at once

What type of bug is this?

What subsystems and features are affected?

What happened?

TimescaleDB version affected

PostgreSQL version used

What operating system did you use?

What installation method did you use?

What platform did you run on?

Relevant log output and stack trace

How can we reproduce the bug?

timescaledb
timescaledb copied to clipboard