postgres icon indicating copy to clipboard operation
postgres copied to clipboard

server process (PID 229) was terminated by signal 7: Bus error

Open dylan2intel opened this issue 2 years ago • 4 comments

postgres 14 deployed in kubernetes with 64Gi huge page on running TPC-C benchmarking with 800 warehouses(~80GB data size)

$ free -g
              total        used        free      shared  buff/cache   available
Mem:            503         134         256           0         113         366
Swap:             0           0           0

kubectl version

Client Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.4", GitCommit:"e6c093d87ea4cbb530a7b2ae91e54c0842d8308a", GitTreeState:"clean", BuildDate:"2022-02-16T12:38:05Z", GoVersion:"go1.17.7", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"23", GitVersion:"v1.23.5", GitCommit:"c285e781331a3785a7f436042c65c5641ce8a9e9", GitTreeState:"clean", BuildDate:"2022-03-16T15:52:18Z", GoVersion:"go1.17.8", Compiler:"gc", Platform:"linux/amd64"}

Detailed errors:

kubectl logs -n sfdev-test-static-hammerdb-tpcc-postgresql-disk-hugepage-on-128   postgresql-7df8c8b6c4-q9jz2
....
.....
2022-04-05 11:55:28.034 GMT [1] LOG:  database system is ready to accept connections
**2022-04-05 12:06:05.719 GMT [1] LOG:  server process (PID 229) was terminated by signal 7: Bus error
2022-04-05 12:06:05.719 GMT [1] DETAIL:  Failed process was running: insert into order_line (ol_o_id, ol_d_id, ol_w_id, ol_number, ol_i_id, ol_supply_w_id, ol_quantity, ol_amount, ol_dist_info, ol_delivery_d) values ('2601','6','798','1','78438','798','5','0.0','WYsLGpch9MMPlrLtOvBAcfPY',null),('2601','6','798','2','78366','798','5','0.0','dkbbzWuT2fYGEA46ysZT4Soq',null),('2601','6','798','3','83817','798','5','0.0','akE67RK2vhlovKtmTT9h4OR6',null),('2601','6','798','4','36375','798','5','0.0','ayTlJV0KL3FbYB4w57kes8VY',null),('2601','6','798','5','40293','798','5','0.0','ZXl83xwLLJnEjeRTwHpayPfy',null),('2601','6','798','6','5079','798','5','0.0','S3zeBuSPobwm9mjZj6tnuShV',null),('2601','6','798','7','73194','798','5','0.0','teE2ELm3J5a81TK2McP10ojr',null),('2601','6','798','8','74758','798','5','0.0','yaykpdVaqKrKZdtSujY2WpNx',null),('2602','6','798','1','25330','798','5','0.0','atSwfYVJbCjI5gfj6S2tjxPE',null),('2602','6','798','2','74285','798','5','0.0','KqMS78XCkxexHrRxSXsQfMT6',null),('2602','6','798','3','10538','798','5','0.0','Foz8BVQA1qsu8lH80MZWXiC6',null),('2602',
2022-04-05 12:06:05.719 GMT [1] LOG:  terminating any other active server processes**
2022-04-05 12:06:05.775 GMT [1] LOG:  all server processes terminated; reinitializing
2022-04-05 12:06:06.024 GMT [249] LOG:  database system was interrupted; last known up at 2022-04-05 12:05:35 GMT
2022-04-05 12:06:08.718 GMT [249] LOG:  database system was not properly shut down; automatic recovery in progress
2022-04-05 12:06:08.718 GMT [249] LOG:  redo starts at 9/F358258
2022-04-05 12:07:19.931 GMT [258] FATAL:  the database system is in recovery mode
2022-04-05 12:07:39.762 GMT [261] FATAL:  the database system is in recovery mode
2022-04-05 12:07:44.540 GMT [263] FATAL:  the database system is in recovery mode
2022-04-05 12:07:56.819 GMT [265] FATAL:  the database system is in recovery mode
2022-04-05 12:08:47.695 GMT [249] LOG:  redo done at 14/74451FB0 system usage: CPU: user: 133.00 s, system: 25.88 s, elapsed: 158.97 s
2022-04-05 12:08:55.073 GMT [1] LOG:  issuing SIGKILL to recalcitrant children
2022-04-05 12:08:55.085 GMT [1] LOG:  all server processes terminated; reinitializing
2022-04-05 12:08:55.314 GMT [273] LOG:  database system shutdown was interrupted; last known up at 2022-04-05 12:08:47 GMT
2022-04-05 12:08:56.072 GMT [273] LOG:  database system was not properly shut down; automatic recovery in progress
2022-04-05 12:08:56.073 GMT [273] LOG:  redo starts at 9/F358258
2022-04-05 12:08:58.177 GMT [275] FATAL:  the database system is in recovery mode
2022-04-05 12:09:02.147 GMT [277] FATAL:  the database system is in recovery mode
2022-04-05 12:09:04.930 GMT [279] FATAL:  the database system is in recovery mode
2022-04-05 12:11:29.808 GMT [273] LOG:  successfully skipped missing contrecord at 14/74451FF8, overwritten at 2022-04-05 12:08:47.699097+00
2022-04-05 12:11:29.808 GMT [273] CONTEXT:  WAL redo at 14/74452018 for XLOG/OVERWRITE_CONTRECORD: lsn 14/74451FF8; time 2022-04-05 12:08:47.699097+00
2022-04-05 12:11:29.808 GMT [273] LOG:  invalid record length at 14/74452048: wanted 24, got 0
2022-04-05 12:11:29.808 GMT [273] LOG:  redo done at 14/74452018 system usage: CPU: user: 131.11 s, system: 22.60 s, elapsed: 153.73 s
2022-04-05 12:12:29.781 GMT [1] LOG:  database system is ready to accept connections


dylan2intel avatar Apr 05 '22 12:04 dylan2intel

https://github.com/kubernetes/kubernetes/issues/71233#issuecomment-447472125 It seems that huge pages don't work with k8s

wglambert avatar Apr 05 '22 16:04 wglambert

Perhaps also https://github.com/docker-library/postgres/issues/451#issuecomment-447472044 and other comments there.

yosifkit avatar Apr 05 '22 16:04 yosifkit

kubernetes/kubernetes#71233 (comment) It seems that huge pages don't work with k8s

NOT AGREE WITH YOU, ACTUALLY k8s works fine with mysql with hugepage on

dylan2intel avatar Apr 20 '22 02:04 dylan2intel

Perhaps also #451 (comment) and other comments there.

The background is that i have an APPLE-TO-APPLE workload comparison with hugepage on/off, so i cannot set it to off.

dylan2intel avatar Apr 20 '22 03:04 dylan2intel

Closing in favor of https://github.com/docker-library/docs/issues/2355 (we're not going to change the image / default configuration for this because we try to follow upstream's defaults faithfully, but better documentation would absolutely be appropriate :heart:).

tianon avatar Dec 20 '23 20:12 tianon