influxdb icon indicating copy to clipboard operation
influxdb copied to clipboard

"context canceled" error after 2.7.6 to 2.7.7 upgrade

Open max0x7ba opened this issue 1 year ago • 6 comments

After influxdb2-2.7.6 to influxdb2-2.7.7 upgrade read queries time out, influxdb logs lvl=warn msg="internal error not returned to client" handler=error_logger error="context canceled" error.

Downgrading back to influxdb2-2.7.6 fixes the problem.

Steps to reproduce:

Execute a read query that returns 3,000,000+ rows with 7 columns with influxdb_client:

from(bucket:"B")
 |> range(start: 1609459200, stop: 1612137600)
 |> filter(fn: (r) => r._measurement == "M" and (r._field == "F1" or r._field == "F2" or r._field == "F3" or r._field == "F4" or r._field == "F5" or r._field == "F6"))
 |> pivot(rowKey: ["_time"], columnKey: ["_field"], valueColumn: "_value")
 |> keep(columns: ["_time","F1","F2","F3","F4","F5","F6"])
 |> map(fn: (r) => ({r with _time: uint(v: r._time)})).

Expected behaviour:

With influxdb2-2.7.6 the query returns data in 47 seconds. influxdb2-2.7.6 logs:

Jul 14 03:38:07 influxd-systemd-start.sh[19220]: ts=2024-07-14T02:38:07.041797Z lvl=info msg="Executed Flux query" log_id=0qNmSaHl000 compiler_type=flux response_size=237485590 query="..." stat_total_duration=50491.140ms stat_compile_duration=0.293ms stat_execute_duration=50490.799ms stat_max_allocated=576243031 stat_total_allocated=0

Actual behaviour:

With influxdb2-2.7.7 the query doesn't return anything, the client times out in 180 seconds. influxdb2-2.7.7 logs:

Jul 14 02:41:30 supernova influxd-systemd-start.sh[4208]: ts=2024-07-14T01:41:30.316178Z lvl=warn msg="internal error not returned to client" log_id=0qNdPszW000 handler=error_logger error="context canceled"

Environment info:

  • Ubuntu 22.04 LTS, Linux kernel 6.5.0-42-lowlatency x86_64
  • influxdb-client-1.44.0
  • influxdb2-2.7.7-1

Config:

query-concurrency = 32
query-queue-size = 32
storage-tsm-use-madv-willneed = true
flux-log-enabled = true
storage-cache-snapshot-write-cold-duration = "24h0m0s"
storage-compact-full-write-cold-duration = "24h0m0s"
storage-retention-check-interval = "24h0m0s"
storage-shard-precreator-check-interval = "24h0m0s"
storage-max-concurrent-compactions = 1
storage-series-file-max-concurrent-snapshot-compactions = 1

max0x7ba avatar Jul 14 '24 03:07 max0x7ba

Can confirm this behaviour, what's worse is that influxdb keeps hogging the CPU forever after this happens. Downgrade to 2.7.6 fixed the problem.

wollew avatar Jul 14 '24 06:07 wollew

Confirmation here. Downgrade back to 2.7.6 from 2.7.8 fixed the problem for me.

desillusion avatar Jul 29 '24 10:07 desillusion

Confirmation here. Downgrading back to 2.7.6 fixed the problem for me and other issues like CPU hight usage.

SDpower avatar Aug 03 '24 16:08 SDpower

May be related to https://github.com/influxdata/influxdb/issues/25226

davidby-influx avatar Aug 08 '24 18:08 davidby-influx

May be related to #25226

I wish an InfluxDB developer commented on what is going on here and diagnosed the root cause in one mental step 🤷🏼‍♂️😬

The "many eyeballs make bugs shallow" free community testing and QA work has been completed above.

It is high time the principal developers of InfluxDB picked up the ball, added missing unit tests reproducing the problem, and fixed this bug. The unit tests would make sure this bug doesn't re-occur.

Don't mean to sound over-bearing or like test-driven development bootcamp.

max0x7ba avatar Aug 08 '24 18:08 max0x7ba

2.7.10 seems to have fixed the problems, at least for me.

wollew avatar Aug 21 '24 07:08 wollew

I'm facing a similar issue on 2.7.4, with a query that uses a reduce. ts=2024-11-05T11:43:22.710906Z lvl=warn msg="internal error not returned to client" log_id=0sebx8mG000 handler=error_logger error="context canceled"

on the influxdb script executor I get Failed to execute Flux query

rfontes17 avatar Nov 06 '24 06:11 rfontes17

@rfontes17 - please upgrade to 2.7.10 and test. We had some fixes for Flux performance in 2.7.10.

davidby-influx avatar Nov 06 '24 16:11 davidby-influx

I have been running 2.7.10-1 for a week, this issue no longer exists in this version.

Thank you.

max0x7ba avatar Nov 06 '24 20:11 max0x7ba

Exact same problem here (same warning messages as OP described).

It was running fine for many months with 2.7.3 (via docker and used by OpenHAB as it’s persistence service). From one day to the next this error occurs, but just for some queries. Also an upgrade to 2.7.10 has not fixed the issue.

fmeili1 avatar Nov 16 '24 18:11 fmeili1

Exact same problem here (same warning messages as OP described).

It was running fine for many months with 2.7.3 (via docker and used by OpenHAB as it’s persistence service). From one day to the next this error occurs, but just for some queries. Also an upgrade to 2.7.10 has not fixed the issue.

Currently using InfluxDB v2.7.10 Same for me, something with MAP is still low speed compared to the old version which is 2.7.3 or 2.7.6

|> map(fn: (r) => ({r with value: float(v: r._value)}))
圖片 This takes more than 30 seconds.

@max0x7ba @davidby-influx

Best Regards

SDpower avatar Nov 21 '24 15:11 SDpower