roachtest: restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed
roachtest.restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed with artifacts on master @ 42f40f59cae3c0fd8842e194d6991c951ab4382f:
(monitor.go:149).Wait: monitor failure: Workload context was not cancelled. Error returned by workload cmd: full command output in run_120009.795339269_n5_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 134
test artifacts and logs in: /artifacts/restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=8ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=falseROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
This test on roachdash | Improve this report!
Jira issue: CRDB-43300
roachtest.restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed with artifacts on master @ 833dadd212fa4b12b1442ae8e00e85ee80a8cdce:
(monitor.go:149).Wait: monitor failure: Workload context was not cancelled. Error returned by workload cmd: full command output in run_123042.518159577_n5_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8/cpu_arch=arm64/run_1
Parameters:
ROACHTEST_arch=arm64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=8ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=falseROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
This looks concerning.
If I roachprod stage CLUSTER-n5cpu8 cockroach 49ca24cedb042579e9645c206640d59975805d12 and then run RESTORE tpcc.customer FROM '/2024/05/16-150432.87' IN 'gs://cockroach-fixtures-us-east1/backups/tpc-c/v24.1/db/warehouses=5k?AUTH=implicit' WITH OPTIONS (experimental deferred copy, skip_missing_foreign_keys, into_db='defaultdb') I see, as I expect:
select count(*) from customer;
count
-------------
150000000
But on the next successful built nightly, the same process yields:
select count(*) from customer;
count
------------
19715873
It's a little tedious bisecting in this range since it straddles the go 1.23 revert making switching across that cause very slow rebuilds, but there are only a few go.mod changes in this period:
g log --oneline 49ca24cedb042579e9645c206640d59975805d12..42f40f59cae3 -- go.mod
88faab92fdf Merge #132150 #132580 #132682
cbead8492b5 Merge #132776
d56ed5c01b3 storage: integrate columnar blocks, disabled by default
86189a031fe build: Revert "build: upgrade to Go 1.23.2"
7f2a743594d Merge #132703 #132761 #132771
5f709684056 go.mod: bump Pebble to 8b6d64f23a33
1529845e685 Merge #132552
84d12ed794b changefeedccl: bump franz-go dependency to fix deadlock
7734dbbce50 go.mod: bump Pebble to 8079611f00bc
Okay, also repros on 86189a031f, as does 5f70968405, so that halves the range,
Narrowed this down to 7734dbbce; it does't reproduce on 30dbb173d0.
roachtest.restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:
(monitor.go:149).Wait: monitor failure: Workload context was not cancelled. Error returned by workload cmd: full command output in run_120517.236362649_n5_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8/run_1
Parameters:
ROACHTEST_arch=amd64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=8ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=falseROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
roachtest.restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:
(monitor.go:149).Wait: monitor failure: Workload context was not cancelled. Error returned by workload cmd: full command output in run_120950.430372455_n5_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8/cpu_arch=arm64/run_1
Parameters:
ROACHTEST_arch=arm64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=8ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=falseROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
Same failure on other branches
- #133005 roachtest: restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed [A-disaster-recovery C-test-failure O-roachtest O-robot T-disaster-recovery branch-release-24.3 release-blocker]
roachtest.restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:
(monitor.go:149).Wait: monitor failure: Workload context was not cancelled. Error returned by workload cmd: full command output in run_123157.006906650_n5_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8/cpu_arch=arm64/run_1
Parameters:
ROACHTEST_arch=arm64ROACHTEST_cloud=gceROACHTEST_coverageBuild=falseROACHTEST_cpu=8ROACHTEST_encrypted=falseROACHTEST_fs=ext4ROACHTEST_localSSD=falseROACHTEST_runtimeAssertionsBuild=falseROACHTEST_ssd=0
Same failure on other branches
- #133005 roachtest: restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed [A-disaster-recovery C-test-failure O-roachtest O-robot T-disaster-recovery branch-release-24.3 release-blocker]
Fixed in https://github.com/cockroachdb/pebble/pull/4077 which should be included in https://github.com/cockroachdb/cockroach/pull/133012