cockroach icon indicating copy to clipboard operation
cockroach copied to clipboard

roachtest: restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed

Open cockroach-teamcity opened this issue 1 year ago • 6 comments

roachtest.restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed with artifacts on master @ 42f40f59cae3c0fd8842e194d6991c951ab4382f:

(monitor.go:149).Wait: monitor failure: Workload context was not cancelled. Error returned by workload cmd: full command output in run_120009.795339269_n5_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 134
test artifacts and logs in: /artifacts/restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

/cc @cockroachdb/disaster-recovery

This test on roachdash | Improve this report!

Jira issue: CRDB-43300

cockroach-teamcity avatar Oct 17 '24 12:10 cockroach-teamcity

roachtest.restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed with artifacts on master @ 833dadd212fa4b12b1442ae8e00e85ee80a8cdce:

(monitor.go:149).Wait: monitor failure: Workload context was not cancelled. Error returned by workload cmd: full command output in run_123042.518159577_n5_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 18 '24 12:10 cockroach-teamcity

This looks concerning.

If I roachprod stage CLUSTER-n5cpu8 cockroach 49ca24cedb042579e9645c206640d59975805d12 and then run RESTORE tpcc.customer FROM '/2024/05/16-150432.87' IN 'gs://cockroach-fixtures-us-east1/backups/tpc-c/v24.1/db/warehouses=5k?AUTH=implicit' WITH OPTIONS (experimental deferred copy, skip_missing_foreign_keys, into_db='defaultdb') I see, as I expect:

select count(*) from customer;
    count
-------------
  150000000

But on the next successful built nightly, the same process yields:

select count(*) from customer;
   count
------------
  19715873

It's a little tedious bisecting in this range since it straddles the go 1.23 revert making switching across that cause very slow rebuilds, but there are only a few go.mod changes in this period:

g log --oneline 49ca24cedb042579e9645c206640d59975805d12..42f40f59cae3 -- go.mod
88faab92fdf Merge #132150 #132580 #132682
cbead8492b5 Merge #132776
d56ed5c01b3 storage: integrate columnar blocks, disabled by default
86189a031fe build: Revert "build: upgrade to Go 1.23.2"
7f2a743594d Merge #132703 #132761 #132771
5f709684056 go.mod: bump Pebble to 8b6d64f23a33
1529845e685 Merge #132552
84d12ed794b changefeedccl: bump franz-go dependency to fix deadlock
7734dbbce50 go.mod: bump Pebble to 8079611f00bc

dt avatar Oct 18 '24 21:10 dt

Okay, also repros on 86189a031f, as does 5f70968405, so that halves the range,

dt avatar Oct 18 '24 21:10 dt

Narrowed this down to 7734dbbce; it does't reproduce on 30dbb173d0.

dt avatar Oct 18 '24 22:10 dt

roachtest.restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:

(monitor.go:149).Wait: monitor failure: Workload context was not cancelled. Error returned by workload cmd: full command output in run_120517.236362649_n5_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8/run_1

Parameters:

  • ROACHTEST_arch=amd64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 19 '24 12:10 cockroach-teamcity

roachtest.restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:

(monitor.go:149).Wait: monitor failure: Workload context was not cancelled. Error returned by workload cmd: full command output in run_120950.430372455_n5_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #133005 roachtest: restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed [A-disaster-recovery C-test-failure O-roachtest O-robot T-disaster-recovery branch-release-24.3 release-blocker]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 20 '24 12:10 cockroach-teamcity

roachtest.restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed with artifacts on master @ 472ea07a5232c98536293d13bb46cca59f9f2cd0:

(monitor.go:149).Wait: monitor failure: Workload context was not cancelled. Error returned by workload cmd: full command output in run_123157.006906650_n5_cockroach-workload-r.log: COMMAND_PROBLEM: exit status 1
test artifacts and logs in: /artifacts/restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8/cpu_arch=arm64/run_1

Parameters:

  • ROACHTEST_arch=arm64
  • ROACHTEST_cloud=gce
  • ROACHTEST_coverageBuild=false
  • ROACHTEST_cpu=8
  • ROACHTEST_encrypted=false
  • ROACHTEST_fs=ext4
  • ROACHTEST_localSSD=false
  • ROACHTEST_runtimeAssertionsBuild=false
  • ROACHTEST_ssd=0
Help

See: roachtest README

See: How To Investigate (internal)

See: Grafana

Same failure on other branches

  • #133005 roachtest: restore/online/workload=true/tpcc/350GB/gce/inc-count=0/nodes=4/cpus=8 failed [A-disaster-recovery C-test-failure O-roachtest O-robot T-disaster-recovery branch-release-24.3 release-blocker]

This test on roachdash | Improve this report!

cockroach-teamcity avatar Oct 21 '24 12:10 cockroach-teamcity

Fixed in https://github.com/cockroachdb/pebble/pull/4077 which should be included in https://github.com/cockroachdb/cockroach/pull/133012

itsbilal avatar Oct 21 '24 14:10 itsbilal