timescaledb icon indicating copy to clipboard operation
timescaledb copied to clipboard

[Bug]: Segmentation fault when INNER JOINing two hypertables

Open boxhock opened this issue 2 years ago • 6 comments

What type of bug is this?

Crash, Unexpected error

What subsystems and features are affected?

Query executor

What happened?

I am trying to run a query on a hypertable, while also INNER JOINing it with another hypertable. However, this causes a segmentation fault. The expected behavior is for the query to complete without a segmentation fault.

Changing the query to use a LEFT JOIN works as expected.

TimescaleDB version affected

2.7.0

PostgreSQL version used

14.3

What operating system did you use?

Ubuntu 20.04.4 x86-64

What installation method did you use?

Deb/Apt

What platform did you run on?

On prem/Self-hosted

Relevant log output and stack trace

2022-06-05 15:34:25.631 UTC [21278] LOG:  server process (PID 146466) was terminated by signal 11: Segmentation fault
2022-06-05 15:34:25.631 UTC [21278] DETAIL:  Failed process was running: SELECT time_bucket_gapfill('1d', tbl1.timestamp) as time,
	       count(*) as count
	FROM table1 tbl1
	         INNER JOIN table2 tbl2
	                    ON tbl1.tbl2_hash = tbl2.hash
	                        AND tbl1.timestamp = tbl2.timestamp
	WHERE tbl1.timestamp BETWEEN '2022-01-25T21:42:21.182Z' AND '2022-06-01T21:42:21.182Z'
	GROUP BY "time"
	ORDER BY "time"
2022-06-05 15:34:25.631 UTC [21278] LOG:  terminating any other active server processes
2022-06-05 15:34:25.831 UTC [21278] LOG:  all server processes terminated; reinitializing
2022-06-05 15:34:26.209 UTC [146469] LOG:  database system was interrupted; last known up at 2022-06-05 15:34:11 UTC
2022-06-05 15:34:26.209 UTC [146470] [unknown]@[unknown] LOG:  PID 146259 in cancel request did not match any process
2022-06-05 15:34:26.209 UTC [146471] [unknown]@[unknown] LOG:  PID 146261 in cancel request did not match any process
2022-06-05 15:34:26.259 UTC [146469] LOG:  database system was not properly shut down; automatic recovery in progress
2022-06-05 15:34:26.261 UTC [146469] LOG:  redo starts at 9285/BF1C36D8
2022-06-05 15:34:26.833 UTC [146472] [unknown]@[unknown] LOG:  PID 146260 in cancel request did not match any process
2022-06-05 15:34:26.833 UTC [146473] [unknown]@[unknown] LOG:  PID 146262 in cancel request did not match any process
2022-06-05 15:34:26.834 UTC [146474] [unknown]@[unknown] LOG:  PID 146263 in cancel request did not match any process
2022-06-05 15:34:27.483 UTC [146469] LOG:  redo done at 9285/F3CA1028 system usage: CPU: user: 0.49 s, system: 0.72 s, elapsed: 1.22 s

How can we reproduce the bug?

It is not entirely clear to me the exact steps to reproduce, but I've reproduced this on multiple instances with somewhat similar steps:

1. Create two hypertables
2. Populate each hypertable with 1,000,000+ rows
  a) Each row in table1 has a reference to a row in table2
  b) Table2 contains more rows than table1
3. Run query to INNER JOIN table1 and table2

boxhock avatar Jun 05 '22 15:06 boxhock

@boxhock can you please share the stack trace of the crashing backend? You can use the below to get a stack trace on ubuntu:

https://wiki.postgresql.org/wiki/Getting_a_stack_trace_of_a_running_PostgreSQL_backend_on_Linux/BSD#On_Ubuntu

nikkhils avatar Jun 06 '22 08:06 nikkhils

Hey @boxhock ... did u had a change to get a stack trace as Nikhil asked before?

fabriziomello avatar Jun 13 '22 14:06 fabriziomello

hey @boxhock any chance to provide a stack trace for us? Also, can you please share with us the exact schemes of the tables you used for reproducing the issue and maybe some sample data? These would help us to be able to constantly reproduce the issue and find the root cause. thank you.

horzsolt avatar Jun 29 '22 07:06 horzsolt

This issue has been automatically marked as stale due to lack of activity. You can remove the stale label or comment. Otherwise, this issue will be closed in 30 days. Thank you!

github-actions[bot] avatar Aug 29 '22 02:08 github-actions[bot]

I have the same problem on PG 14.5 with TSDB 2.8.0, getting a segfault when using time_bucket_gapfill with an INNER JOIN between two hypertables. Considering going back to v1.7.5 😞

I'm running Postgres in Docker, and I've no clue how to get a stack trace out of it - if someone can advise, I'd be happy to try?

cocowalla avatar Sep 21 '22 11:09 cocowalla

Hello @cocowalla,

Thank you very much for reaching out. Creating stack traces in a Docker container can be laborious. Do you have a self-contained example that reproduces the crash that you can share? I would then try to reproduce the crash on my system and generate a stack trace.

jnidzwetzki avatar Sep 21 '22 13:09 jnidzwetzki

@jnidzwetzki I was literally just putting something together, and I saw a notification that it's fixed!

What's the easiest way for me to test it before it hits v2.8.1 please?

cocowalla avatar Sep 28 '22 17:09 cocowalla

Easiest way is the nightly docker image: timescaledev/timescaledb:nightly-pg14. You might have to wait a day till tomorrow till it picks up the bugfix.

svenklemm avatar Sep 28 '22 17:09 svenklemm

Easiest way is the nightly docker image: timescaledev/timescaledb:nightly-pg14. You might have to wait a day till tomorrow till it picks up the bugfix.

The latest timescaledev build is 20 days old (seems odd for a Nightly tag 😕!), so I built it myself instead. Now, the segfault is fixed, which is great!

But, this issue only seems to be partially resolved - if I do an INNER JOIN between tables, or I do a LEFT JOIN and include the joined table in the WHERE clause, I get:

ORDER/GROUP BY expression not found in targetlist

I'll prepare a repro and create a new issue.

cocowalla avatar Sep 29 '22 13:09 cocowalla

@jnidzwetzki I just opened issue #4774 in relation to this

cocowalla avatar Sep 29 '22 14:09 cocowalla