gemini icon indicating copy to clipboard operation
gemini copied to clipboard

runtime: marked free object in span

Open soyacz opened this issue 2 years ago • 4 comments

Issue description

Gemini failed with some error:

2023-03-07 11:53:03.119: (GeminiStressEvent Severity.ERROR) period_type=end event_id=26aa615b-3e24-4278-8755-eed0f8d5961a duration=3h56m16s: node=Node gemini-with-nemesis-3h-normal-5-0-loader-node-c2cc17cd-1 [3.215.134.19 | 10.0.0.197] (seed: False) gemini_cmd=/$HOME/gemini -d --duration 3h --warmup 30m -c 50 -m mixed -f --non-interactive --cql-features normal --max-mutation-retries 5 --max-mutation-retries-backoff 500ms --async-objects-stabilization-attempts 5 --async-objects-stabilization-backoff 500ms --replication-strategy "{'class': 'SimpleStrategy', 'replication_factor': '3'}" --oracle-replication-strategy "{'class': 'SimpleStrategy', 'replication_factor': '1'}" --test-cluster=10.0.3.199,10.0.0.101,10.0.2.18 --outfile /home/centos/gemini_result_d63e4ea6-aa6f-4671-965b-7d0174f6f080.log --seed 49 --request-timeout 180s --connect-timeout 120s --oracle-cluster=10.0.2.248
result=Exit code: 2
Command output: ['}', '{"L":"INFO","T":"2023-03-07T07:56:52.237Z","N":"generator","M":"starting partition key generation loop"}']
errors=['Command error: runtime: marked free object in span 0x7fb8c6f1f2a0, elemsize=16 freeindex=0 (bad use of unsafe.Pointer? try -d=checkptr)\n0xc0c6b06000 free  unmarked\n0xc0c6b06010 free  unmarked\n0xc0c6b06020 free  unmarked\n0xc0c6b06030 free  unmarked\n0xc0c6b06040 free  unmarked\n0xc0c6b06050 free  unmarked\n0xc0c6b06060 free  unmarked\n0xc0c6b06070 free  unmarked\n0xc0c6b06080 free  unmarked\n0xc0c6b06090 free  unmarked\n0xc0c6b060a0 free  unmarked\n0xc0c6b060b0 free  unmarked\n0xc0c6b060c0 free  unmarked\n0xc0c6b060d0 free  unmarked\n0xc0c6b060e0 free  unmarked\n0xc0c6b060f0 free  unmarked\n0xc0c6b06100 free  unmarked\n0xc0c6b06110 free  unmarked\n0xc0c6b06120 free  unmarked\n0xc0c6b06130 free  
[...] - long stack

Installation details

Kernel Version: 5.15.0-1030-aws Scylla version (or git commit hash): 5.0.11-20230223.019d5cde1 with build-id 0dbc6fc31785d7f3f3ac86bb876d008fe86503b9

Cluster size: 3 nodes (i3.large)

Scylla Nodes used in this run:

  • gemini-with-nemesis-3h-normal-5-0-oracle-db-node-c2cc17cd-1 (3.238.153.121 | 10.0.2.248) (shards: 2)
  • gemini-with-nemesis-3h-normal-5-0-db-node-c2cc17cd-4 (3.236.168.242 | 10.0.3.231) (shards: 2)
  • gemini-with-nemesis-3h-normal-5-0-db-node-c2cc17cd-3 (3.234.141.113 | 10.0.2.18) (shards: 2)
  • gemini-with-nemesis-3h-normal-5-0-db-node-c2cc17cd-2 (34.205.81.229 | 10.0.0.101) (shards: 2)
  • gemini-with-nemesis-3h-normal-5-0-db-node-c2cc17cd-1 (100.27.7.28 | 10.0.3.199) (shards: 2)

OS / Image: ami-0311e217a4fa530cc (aws: us-east-1)

Test: gemini-3h-with-nemesis-test Test id: c2cc17cd-fd2d-4624-b61d-27308f0efc30 Test name: scylla-5.0/gemini-/gemini-3h-with-nemesis-test Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor c2cc17cd-fd2d-4624-b61d-27308f0efc30
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs c2cc17cd-fd2d-4624-b61d-27308f0efc30

Logs:

Jenkins job URL

soyacz avatar Mar 07 '23 15:03 soyacz

@dkropachev possibly the same as the other one I asked you about.

roydahan avatar Apr 23 '23 13:04 roydahan

@roydahan , this is different, there is a place where gemini uses unsafe.Pointer to address part of object that is being deleted in parallel, i have looked over gemini and found exactly one place where unsafe.Pointer is used and it is not it. So, this issue comes from dependancies, which makes it really hard to investigate, but I'am working on it

dkropachev avatar Apr 24 '23 17:04 dkropachev

This issue could be triggered in three ways:

  1. Unsafe use of unsafe.Pointer() by libraries that are used by gemini. Gemini itself is not doing anything that could cause it
  2. Bug in compiler that leads to having pointer to an memory of block that was freed by GC
  3. Bug in GC that makes it overlook pointer to an object on one cycle and picking up pointer on next cycle

None of them are related to the gemini it self. There was similar bug in Golang v1.17,v1.16 and v1.15. But I beileve that the environment where this panic occured had golang with this bug fixed.

Unfortunatelly, only helpfull information from the stack is that sizeclass of the object is 16 and most-likely there is an list of them.

Recomendation - let's update golang to the topmost, which I bielive was already done and see how it goes.

dkropachev avatar Apr 30 '23 23:04 dkropachev

happened again this week:

Installation details

Kernel Version: 5.15.0-1035-aws Scylla version (or git commit hash): 5.3.0~dev-20230425.124153d43915 with build-id c084d51a610cd0c9956abf5bed57fbd54301731c

Cluster size: 3 nodes (i3.4xlarge)

Scylla Nodes used in this run:

  • gemini-1tb-10h-master-oracle-db-node-79ca751e-1 (34.235.125.229 | 10.12.8.35) (shards: 14)
  • gemini-1tb-10h-master-db-node-79ca751e-4 (3.208.23.71 | 10.12.10.98) (shards: 14)
  • gemini-1tb-10h-master-db-node-79ca751e-3 (184.73.115.186 | 10.12.9.124) (shards: 14)
  • gemini-1tb-10h-master-db-node-79ca751e-2 (35.172.212.11 | 10.12.10.207) (shards: 14)
  • gemini-1tb-10h-master-db-node-79ca751e-1 (54.227.120.111 | 10.12.10.220) (shards: 14)

OS / Image: ami-0594293f3fa584dd4 (aws: us-east-1)

Test: gemini-1tb-10h Test id: 79ca751e-dbbd-428e-8ce6-f7b00447eb7d Test name: scylla-master/gemini-/gemini-1tb-10h Test config file(s):

Logs and commands
  • Restore Monitor Stack command: $ hydra investigate show-monitor 79ca751e-dbbd-428e-8ce6-f7b00447eb7d
  • Restore monitor on AWS instance using Jenkins job
  • Show all stored logs command: $ hydra investigate show-logs 79ca751e-dbbd-428e-8ce6-f7b00447eb7d

Logs:

Jenkins job URL

fruch avatar May 02 '23 08:05 fruch