scylla-bench icon indicating copy to clipboard operation
scylla-bench copied to clipboard

panic: runtime error: invalid memory address or nil pointer dereference

Open yarongilor opened this issue 2 years ago • 7 comments

Installation details

Kernel Version: 5.15.0-1019-aws Scylla version (or git commit hash): 5.0.3-20220907.b9a61c8e9 with build-id 7be266d2954825cdf843c744de04a0443a8f156c Relocatable Package: http://downloads.scylladb.com/downloads/scylla/relocatable/scylladb-5.0/scylla-x86_64-package-5.0.3.0.20220907.b9a61c8e9.tar.gz Cluster size: 4 nodes (i3en.3xlarge)

Scylla Nodes used in this run:

  • longevity-large-partitions-4d-scyll-db-node-e83f2364-4 (13.51.47.122 | 10.0.3.49) (shards: 12)
  • longevity-large-partitions-4d-scyll-db-node-e83f2364-3 (16.16.25.148 | 10.0.3.35) (shards: 12)
  • longevity-large-partitions-4d-scyll-db-node-e83f2364-2 (16.170.221.203 | 10.0.2.203) (shards: 12)
  • longevity-large-partitions-4d-scyll-db-node-e83f2364-1 (13.49.225.5 | 10.0.0.73) (shards: 12)

OS / Image: ami-03fc0de751a0b3314 (aws: eu-north-1)

Test: longevity-large-partition-4days-test-rq Test id: e83f2364-eaf5-4e99-8c28-433f76c2a24e Test name: scylla-staging/Longevity_yaron/longevity-large-partition-4days-test-rq Test config file(s):

Issue description

>>>>>>>

  1. Started 3 read stress (ASC, DESC, ASC/DESC/None) that ran "ok" for 10 hours.
  2. Throughput was (relatively low): 5k
  3. after ~ 8.5 hours the stress stared getting quorum timeouts.
  4. after 10 hours one of the stress (ASC) failed for that and loader got panic as well. invalid memory address or nil pointer dereference

error from s-b log:

yarongilor@yarongilor:~/Downloads/logs/loader-set-e83f2364$ tail scylla-bench-l0-34379347-f65c-4327-808c-554732ff449c.log -n 40
  
10h32m19.7s    7718   77180      0 1.9s   1.5s   706ms  495ms  260ms  3.8ms  65ms   
panic: runtime error: invalid memory address or nil pointer dereference
10h32m20.8s    7616   76160      0 1.9s   1.6s   703ms  502ms  287ms  3.9ms  67ms   
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x5fb33c]

goroutine 1 [running]:
github.com/HdrHistogram/hdrhistogram-go.(*iterator).next(0xc000171448)
	/go/pkg/mod/github.com/!hdr!histogram/[email protected]/hdr.go:670 +0x1c
github.com/HdrHistogram/hdrhistogram-go.(*rIterator).next(...)
	/go/pkg/mod/github.com/!hdr!histogram/[email protected]/hdr.go:683
github.com/HdrHistogram/hdrhistogram-go.(*Histogram).Merge(0xf0000000e?, 0x4000000000a?)
	/go/pkg/mod/github.com/!hdr!histogram/[email protected]/hdr.go:177 +0x8d
github.com/scylladb/scylla-bench/pkg/results.(*MergedResult).AddResult(0xc2abffef60, {0x0, 0x0, 0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, 0x0, ...})
	/go/scylla-bench-0.1.11/pkg/results/merged_result.go:53 +0x1b0
github.com/scylladb/scylla-bench/pkg/results.(*TestResults).GetResultsFromThreadsAndMerge(0xc000691380)
	/go/scylla-bench-0.1.11/pkg/results/thread_results.go:60 +0x89
github.com/scylladb/scylla-bench/pkg/results.(*TestResults).GetTotalResults(0xc000691380)
	/go/scylla-bench-0.1.11/pkg/results/thread_results.go:82 +0xcc
main.main()
	/go/scylla-bench-0.1.11/main.go:596 +0x355d

Screenshot from 2022-09-28 13-19-26

<<<<<<<

  • Restore Monitor Stack command: $ hydra investigate show-monitor e83f2364-eaf5-4e99-8c28-433f76c2a24e
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs e83f2364-eaf5-4e99-8c28-433f76c2a24e

Logs:

No logs captured during this run.

Jenkins job URL

yarongilor avatar Sep 28 '22 10:09 yarongilor

@vponomaryov do you see what's causing this issue?

roydahan avatar Oct 03 '22 09:10 roydahan

@vponomaryov do you see what's causing this issue?

General info: Such errors appear when we try to use not initialized go-object. It may be caused either by a race condition or due to an unhandled error.

I haven't worked on the investigation of it to see the cause.

vponomaryov avatar Oct 03 '22 09:10 vponomaryov

Seems like when working with the reverse-query feature, after few hours of run we hit this issue.

roydahan avatar Oct 03 '22 09:10 roydahan

@yarongilor I created PR with possible fix for it here: https://github.com/scylladb/scylla-bench/pull/109 It may fix this issue, not guaranteed. Need to test it.

Upd: Created docker image with it here: vponomarovatscylladb/hydra-loaders:scylla-bench-v0.1.12--fix-issue-107-candidate So, just update your configuration with it.

vponomaryov avatar Oct 04 '22 16:10 vponomaryov

Hit it once again using the same config file (changed for some extent since then) for the scylla-bench.

Installation details

Kernel Version: 5.15.0-1030-aws Scylla version (or git commit hash): 5.2.0~rc2-20230228.908a82bea064 with build-id 2d8e1ab089ec69c36323037d66b1a72accfae399

Cluster size: 4 nodes (is4gen.4xlarge)

Scylla Nodes used in this run:

  • longevity-large-partitions-4d-5-2-db-node-c3260702-4 (34.245.121.241 | 10.4.1.65) (shards: 15)
  • longevity-large-partitions-4d-5-2-db-node-c3260702-3 (52.211.199.21 | 10.4.3.139) (shards: 15)
  • longevity-large-partitions-4d-5-2-db-node-c3260702-2 (52.48.75.186 | 10.4.0.214) (shards: 15)
  • longevity-large-partitions-4d-5-2-db-node-c3260702-1 (34.244.173.7 | 10.4.2.232) (shards: 15)

OS / Image: ami-074d26a74b8f73dba (aws: eu-west-1)

Test: longevity-large-partition-4days-arm-test Test id: c3260702-5b50-4389-8303-7464c8d5e384 Test name: scylla-5.2/longevity/longevity-large-partition-4days-arm-test Test config file(s):

Details:

It had 3 loaders. Pre-load finished without errors. Then, the main read stress commands failed on 2 loaders from 3. One of the loader failures is the same as in this bugreport:

2023/03/02 22:13:31 Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM.
2023/03/02 22:13:31 Operation timed out for scylla_bench.test - received only 1 responses from 2 CL=QUORUM.
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x40 pc=0x5fc7dc]

goroutine 1 [running]:
github.com/HdrHistogram/hdrhistogram-go.(*iterator).next(0xc000119038)
	/go/pkg/mod/github.com/!hdr!histogram/[email protected]/hdr.go:670 +0x1c
github.com/HdrHistogram/hdrhistogram-go.(*rIterator).next(...)
	/go/pkg/mod/github.com/!hdr!histogram/[email protected]/hdr.go:683
github.com/HdrHistogram/hdrhistogram-go.(*Histogram).Merge(0xf0000000e?, 0x4000000000a?)
	/go/pkg/mod/github.com/!hdr!histogram/[email protected]/hdr.go:177 +0x8d
github.com/scylladb/scylla-bench/pkg/results.(*MergedResult).AddResult(0xc1e0defb60, {0x0, 0x0, 0x0, 0x0, 0x0, {0x0, 0x0, 0x0}, 0x0, ...})
	/go/scylla-bench-0.1.15/pkg/results/merged_result.go:53 +0x1b0
github.com/scylladb/scylla-bench/pkg/results.(*TestResults).GetResultsFromThreadsAndMerge(0xc000413b80)
	/go/scylla-bench-0.1.15/pkg/results/thread_results.go:60 +0x89
github.com/scylladb/scylla-bench/pkg/results.(*TestResults).GetTotalResults(0xc000413b80)
	/go/scylla-bench-0.1.15/pkg/results/thread_results.go:82 +0xcc
main.main()
	/go/scylla-bench-0.1.15/main.go:631 +0x39bd

It failed after 35 minutes of running.

  • Restore Monitor Stack command: $ hydra investigate show-monitor c3260702-5b50-4389-8303-7464c8d5e384
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs c3260702-5b50-4389-8303-7464c8d5e384

Logs:

Jenkins job URL

vponomaryov avatar Mar 03 '23 12:03 vponomaryov

@roydahan @fgelcer @fruch JFYI: the proposed fix in the https://github.com/scylladb/scylla-bench/pull/109 haven't had any attention since October 2022.

vponomaryov avatar Mar 03 '23 12:03 vponomaryov

@roydahan @fgelcer @fruch JFYI: the proposed fix in the https://github.com/scylladb/scylla-bench/pull/109 haven't had any attention since October 2022.

I assumed it was a side effect of running out of memory, but we can get it merged, regardless.

fruch avatar Mar 03 '23 13:03 fruch