scylla-cluster-tests
scylla-cluster-tests copied to clipboard
SCT raised a CoreDumpEvent, but the coredump itself was not found anywhere
Installation details
Kernel version: 5.4.0-1035-aws
Scylla version (or git commit hash): 4.6.dev-0.20210613.846f0bd16e4 with build-id 77ebbc518e4fd9560d3993067706780031d4ee26
Cluster size: 4 nodes (i3.4xlarge)
Scylla running with shards number (live nodes):
longevity-200gb-48h-verify-limited--db-node-eadde21f-1 (13.51.156.141 | 10.0.1.212): 14 shards
longevity-200gb-48h-verify-limited--db-node-eadde21f-2 (13.51.159.161 | 10.0.3.56): 14 shards
longevity-200gb-48h-verify-limited--db-node-eadde21f-5 (13.49.65.153 | 10.0.1.23): 14 shards
longevity-200gb-48h-verify-limited--db-node-eadde21f-7 (13.51.48.88 | 10.0.1.0): 14 shards
Scylla running with shards number (terminated nodes):
longevity-200gb-48h-verify-limited--db-node-eadde21f-4 (13.48.24.246 | 10.0.1.125): 14 shards
longevity-200gb-48h-verify-limited--db-node-eadde21f-3 (13.51.55.116 | 10.0.3.249): 14 shards
longevity-200gb-48h-verify-limited--db-node-eadde21f-6 (13.53.182.66 | 10.0.1.68): 14 shards
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0efd9637b9940c9b5 (aws: eu-north-1)
Test: longevity-200gb-48h
Test name: longevity_test.LongevityTest.test_custom_time
Test config file(s):
Issue description
there is an event:
2021-06-19 08:05:29.654: (CoreDumpEvent Severity.ERROR) period_type=not-set event_id=3a31784f-c58c-4340-85b4-73e75293ae8b node=Node longevity-200gb-48h-verify-limited--db-node-eadde21f-2 [13.51.159.161 | 10.0.3.56] (seed: False)
there is something in the node's coredumps.info:
PID: 96645 (scylla)
UID: 113 (scylla)
GID: 119 (scylla)
Signal: 11 (SEGV)
Timestamp: Sat 2021-06-19 08:05:03 UTC (20h ago)
Command Line: /usr/bin/scylla --blocked-reactor-notify-ms 100 --abort-on-lsa-bad-alloc 1 --abort-on-seastar-bad-alloc --abort-on-internal-error 1 --abort-on-ebadf 1 --enable-sstable-key-validation 1 --log-to-syslog 1 --log-to-stdout 0 --default-log-level info --network-stack posix --io-properties-file=/etc/scylla.d/io_properties.yaml --cpuset 1-7,9-15 --lock-memory=1
Executable: /opt/scylladb/libexec/scylla
Control Group: /scylla.slice/scylla-server.slice/scylla-server.service
Unit: scylla-server.service
Slice: scylla-server.slice
Boot ID: 1836c9b98e11461094c09b1fe93491d2
Machine ID: 0d278baa2bee456599166e7a3d1d8f38
Hostname: longevity-200gb-48h-verify-limited--db-node-eadde21f-2
Storage: none
Message: Process 96645 (scylla) of user 113 dumped core.
but where is the core? why did it happen? i see that the node's log watch moved back in time:
2021-06-19T08:03:55+00:00 longevity-200gb-48h-verify-limited--db-node-eadde21f-2 !INFO | sshd[286855]: pam_unix(sshd:session): session opened for user scyllaadm by (uid=0)
2021-06-19T07:55:32+00:00 longevity-200gb-48h-verify-limited--db-node-eadde21f-2 !INFO | systemd[1]: session-3795.scope: Succeeded.
then back:
2021-06-19T07:57:33+00:00 longevity-200gb-48h-verify-limited--db-node-eadde21f-2 !INFO | scylla: [shard 8] compaction - [Compact keyspace1.standard1 0128fe40-d0d4-11eb-9874-457e9b14cbb1] Compacting [/var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-349770-big-Data.db:level=0:origin=memtable, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348216-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348118-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348230-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348160-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348202-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348132-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348104-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348188-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348174-big-Data.db:level=1:origin=compaction, /var/lib/scylla/data/keyspace1/standard1-3d0c3510cfe611eb96c70f5316c4eada/md-348146-big-Data.db:level=1:origin=compaction, ]
2021-06-19T08:05:41+00:00 longevity-200gb-48h-verify-limited--db-node-eadde21f-2 !INFO | systemd-logind[674]: Removed session 3889.
and again forward:
2021-06-19T08:21:56+00:00 longevity-200gb-48h-verify-limited--db-node-eadde21f-2 !INFO | systemd[1]: Started Session 4047 of user scyllaadm.
2021-06-19T08:03:55+00:00 longevity-200gb-48h-verify-limited--db-node-eadde21f-2 !INFO | systemd-logind[674]: New session 3872 of user scyllaadm.
Restore Monitor Stack command: $ hydra investigate show-monitor eadde21f-ad93-476f-a546-842a4fea2708
Show all stored logs command: $ hydra investigate show-logs eadde21f-ad93-476f-a546-842a4fea2708
Test id: eadde21f-ad93-476f-a546-842a4fea2708
Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/eadde21f-ad93-476f-a546-842a4fea2708/20210620_042523/grafana-screenshot-longevity-200gb-48h-scylla-per-server-metrics-nemesis-20210620_042932-longevity-200gb-48h-verify-limited--monitor-node-eadde21f-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/eadde21f-ad93-476f-a546-842a4fea2708/20210620_043349/db-cluster-eadde21f.zip loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/eadde21f-ad93-476f-a546-842a4fea2708/20210620_043349/loader-set-eadde21f.zip monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/eadde21f-ad93-476f-a546-842a4fea2708/20210620_043349/monitor-set-eadde21f.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/eadde21f-ad93-476f-a546-842a4fea2708/20210620_043349/sct-runner-eadde21f.zip
it happened during Enospc nemesis:
2021-06-19 08:04:45.492: (DisruptionEvent Severity.NORMAL) period_type=not-set event_id=461b9da3-238b-4fe2-ac36-8d1d9a2ecc68: type=Enospc subtype=start target_node=Node longevity-200gb-48h-verify-limited--db-node-eadde21f-2 [13.51.159.161 | 10.0.3.56] (seed: False) duration=None
2021-06-19 08:05:29.654: (CoreDumpEvent Severity.ERROR) period_type=not-set event_id=3a31784f-c58c-4340-85b4-73e75293ae8b node=Node longevity-200gb-48h-verify-limited--db-node-eadde21f-2 [13.51.159.161 | 10.0.3.56] (seed: False)
2021-06-19 08:07:17.676: (PrometheusAlertManagerEvent Severity.WARNING) period_type=not-set event_id=50170968-a93c-4f54-880a-a5c36244b0ff: alert_name=InstanceDown type=start start=2021-06-19T08:07:09.591Z end=2021-06-19T08:11:09.591Z description=10.0.3.56 has been down for more than 30 seconds. updated=2021-06-19T08:07:09.647Z state=active fingerprint=6aa989b420871186 labels={'alertname': 'InstanceDown', 'instance': '10.0.3.56', 'job': 'scylla', 'monitor': 'scylla-monitor', 'severity': '2'}
2021-06-19 08:07:17.677: (PrometheusAlertManagerEvent Severity.WARNING) period_type=not-set event_id=24103eb8-cea3-41cc-aeb2-392142cdd329: alert_name=DiskFull type=start start=2021-06-19T08:07:09.591Z end=2021-06-19T08:11:09.591Z description=10.0.3.56 has less than 1% free disk space. updated=2021-06-19T08:07:09.651Z state=active fingerprint=d1f83e9ba67c51f7 labels={'alertname': 'DiskFull', 'device': '/dev/md0', 'fstype': 'xfs', 'instance': '10.0.3.56', 'job': 'node_exporter', 'monitor': 'scylla-monitor', 'mountpoint': '/var/lib/scylla', 'severity': '4'}
2021-06-19 08:08:30.866: (FullScanEvent Severity.WARNING) period_type=not-set event_id=aab6773a-9c85-4ad2-9e00-f6b26041a268: type=finish select_from=keyspace1.standard1 on db_node=10.0.1.212 message=Error from server: code=1200 [Coordinator node timed out waiting for replica nodes' responses] message="Operation timed out for keyspace1.standard1 - received only 0 responses from 1 CL=ONE." info={'consistency': 'ONE', 'required_responses': 1, 'received_responses': 0}
2021-06-18 22:22:19.000: (DatabaseLogEvent Severity.WARNING) period_type=one-time event_id=3d8b19f1-02b3-42e8-828d-86bc7c5dc737: type=CLIENT_DISCONNECT regex=\!INFO.*cql_server - exception while processing connection:.* line_number=293999 node=Node longevity-200gb-48h-verify-limited--db-node-eadde21f-1 [13.51.156.141 | 10.0.1.212] (seed: True)
2021-06-18T22:22:19+00:00 longevity-200gb-48h-verify-limited--db-node-eadde21f-1 !INFO | scylla: [shard 4] cql_server - exception while processing connection: std::system_error (error GnuTLS:-10, The specified session has been invalidated for some reason.)
2021-06-19 08:11:13.645: (PrometheusAlertManagerEvent Severity.WARNING) period_type=not-set event_id=95910ea2-d816-482e-ae23-383c2cdc4332: alert_name=InstanceDown type=end start=2021-06-19T08:07:09.591Z end=2021-06-19T08:13:09.591Z description=10.0.3.56 has been down for more than 30 seconds. updated=2021-06-19T08:09:09.644Z state=active fingerprint=6aa989b420871186 labels={'alertname': 'InstanceDown', 'instance': '10.0.3.56', 'job': 'scylla', 'monitor': 'scylla-monitor', 'severity': '2'}
2021-06-19 08:11:13.646: (PrometheusAlertManagerEvent Severity.WARNING) period_type=not-set event_id=5383e5bb-57b3-4244-9372-7b7e82c314cf: alert_name=DiskFull type=end start=2021-06-19T08:07:09.591Z end=2021-06-19T08:13:09.591Z description=10.0.3.56 has less than 1% free disk space. updated=2021-06-19T08:09:09.648Z state=active fingerprint=d1f83e9ba67c51f7 labels={'alertname': 'DiskFull', 'device': '/dev/md0', 'fstype': 'xfs', 'instance': '10.0.3.56', 'job': 'node_exporter', 'monitor': 'scylla-monitor', 'mountpoint': '/var/lib/scylla', 'severity': '4'}
2021-06-19 08:12:16.738: (PrometheusAlertManagerEvent Severity.WARNING) period_type=not-set event_id=0ab516d2-0032-4f67-9274-8a6ea79a6ad6: alert_name=restart type=start start=2021-06-19T08:12:09.591Z end=2021-06-19T08:16:09.591Z description=Node restarted updated=2021-06-19T08:12:09.753Z state=active fingerprint=0002ec29f7b65adf labels={'alertname': 'restart', 'instance': '10.0.3.56', 'job': 'scylla', 'monitor': 'scylla-monitor', 'severity': '1', 'shard': '0'}
2021-06-19 08:13:31.718: (FullScanEvent Severity.NORMAL) period_type=not-set event_id=f24b7900-400e-4b3e-a355-6449e79f8778: type=start select_from=keyspace1.standard1 on db_node=10.0.1.212
2021-06-19 08:13:41.986: (DisruptionEvent Severity.NORMAL) period_type=not-set event_id=c87a0548-8000-4095-8eaa-f5c9873a9cbb: type=Enospc subtype=end target_node=Node longevity-200gb-48h-verify-limited--db-node-eadde21f-2 [13.51.159.161 | 10.0.3.56] (seed: False) duration=535
so probably that old issue we never been able to report the core, as we don't have enough space to dump it to the disk, while we test enospc
same, or very similar happened here too:
Installation details
Kernel version: 5.4.0-1035-aws
Scylla version (or git commit hash): 4.6.dev-0.20210613.846f0bd16e4 with build-id 77ebbc518e4fd9560d3993067706780031d4ee26
Cluster size: 6 nodes (i3.4xlarge)
Scylla running with shards number (live nodes):
longevity-tls-50gb-3d-master-db-node-b2ffd3dd-1 (13.51.156.177 | 10.0.3.204): 14 shards
longevity-tls-50gb-3d-master-db-node-b2ffd3dd-2 (13.51.241.70 | 10.0.0.131): 14 shards
longevity-tls-50gb-3d-master-db-node-b2ffd3dd-3 (13.51.233.160 | 10.0.0.167): 14 shards
longevity-tls-50gb-3d-master-db-node-b2ffd3dd-4 (13.51.64.95 | 10.0.0.69): 14 shards
longevity-tls-50gb-3d-master-db-node-b2ffd3dd-5 (13.48.71.164 | 10.0.0.207): 14 shards
longevity-tls-50gb-3d-master-db-node-b2ffd3dd-6 (13.53.38.115 | 10.0.1.111): 14 shards
OS (RHEL/CentOS/Ubuntu/AWS AMI): ami-0efd9637b9940c9b5 (aws: eu-north-1)
Test: longevity-50gb-3days
Test name: longevity_test.LongevityTest.test_custom_time
Test config file(s):
Issue description
====================================
PUT ISSUE DESCRIPTION HERE
====================================
Restore Monitor Stack command: $ hydra investigate show-monitor b2ffd3dd-f590-43f8-8656-c2b87081b576
Show all stored logs command: $ hydra investigate show-logs b2ffd3dd-f590-43f8-8656-c2b87081b576
Test id: b2ffd3dd-f590-43f8-8656-c2b87081b576
Logs: grafana - https://cloudius-jenkins-test.s3.amazonaws.com/b2ffd3dd-f590-43f8-8656-c2b87081b576/20210618_042314/grafana-screenshot-longevity-50gb-3days-scylla-per-server-metrics-nemesis-20210618_042724-longevity-tls-50gb-3d-master-monitor-node-b2ffd3dd-1.png db-cluster - https://cloudius-jenkins-test.s3.amazonaws.com/b2ffd3dd-f590-43f8-8656-c2b87081b576/20210618_043147/db-cluster-b2ffd3dd.zip loader-set - https://cloudius-jenkins-test.s3.amazonaws.com/b2ffd3dd-f590-43f8-8656-c2b87081b576/20210618_043147/loader-set-b2ffd3dd.zip monitor-set - https://cloudius-jenkins-test.s3.amazonaws.com/b2ffd3dd-f590-43f8-8656-c2b87081b576/20210618_043147/monitor-set-b2ffd3dd.zip sct-runner - https://cloudius-jenkins-test.s3.amazonaws.com/b2ffd3dd-f590-43f8-8656-c2b87081b576/20210618_043147/sct-runner-b2ffd3dd.zip
That is expected behavior to do not see core during enospc.
Don't know if to call expected behavior, it a known issue in SCT, that never got fixed
the only way to get it dumped is to have an external disk used only for this function, so once there is a core dump, it will have a free disk to do it... but it has other implications and challenges... so i agree with @fruch , it is a known issue more than expected behavior..
What we can do is to stream cores to s3 via curl and pick them up form there
What we can do is to stream cores to s3 via curl and pick them up form there
is the OS able to stream to S3 instead of dumping it to disk?
What we can do is to stream cores to s3 via curl and pick them up form there
something like this: https://gist.github.com/hashbrowncipher/57dd3a52103cae02290ac65fae9f3422
Issue description
I also faced this problem. Core dump was uploaded and shared download_instruction:
< t:2022-05-26 13:36:43,019 f:base.py l:142 c:RemoteLibSSH2CmdRunner p:DEBUG > Command "sudo curl --request PUT --upload-file '/var/lib/systemd/coredump/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz' 'upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz'" finished with status 0
< t:2022-05-26 13:36:43,019 f:coredump.py l:212 c:sdcm.cluster_aws p:INFO > Node longevity-mv-si-4d-2022-1-db-node-81fb644c-8 [16.171.47.121 | 10.0.3.115] (seed: False): CoredumpExportSystemdThread: You can download it by https://storage.cloud.google.com/upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz (available for ScyllaDB employee)
< t:2022-05-26 13:36:43,022 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > corefile_url=https://storage.cloud.google.com/upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz
< t:2022-05-26 13:36:43,022 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > Storage: /var/lib/systemd/coredump/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000
< t:2022-05-26 13:36:43,022 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > download_instructions=gsutil cp gs://upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz .
< t:2022-05-26 13:36:43,022 f:file_logger.py l:101 c:sdcm.sct_events.file_logger p:INFO > gunzip /var/lib/systemd/coredump/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz
But developers complained that they can't find it. Actually it can't be found:
juliayakovlev@juliayakovlev-Latitude-5421 ~/Downloads $ gsutil cp gs://upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz .
CommandException: No URLs matched: gs://upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz
And when I search in manually in GCE, it's not found. And with this URL it's not found: https://storage.cloud.google.com/upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz
No such object: upload.scylladb.com/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz
Installation details
Kernel Version: 5.13.0-1022-aws
Scylla version (or git commit hash): 2022.1~rc5-20220515.6a1e89fbb with build-id 5cecadda59974548befb4305363bf374631fc3e1
Cluster size: 5 nodes (i3.4xlarge)
Scylla Nodes used in this run:
- longevity-mv-si-4d-2022-1-db-node-81fb644c-9 (16.171.58.120 | 10.0.2.39) (shards: 14)
- longevity-mv-si-4d-2022-1-db-node-81fb644c-8 (16.171.47.121 | 10.0.3.115) (shards: 14)
- longevity-mv-si-4d-2022-1-db-node-81fb644c-7 (13.48.104.138 | 10.0.2.234) (shards: 14)
- longevity-mv-si-4d-2022-1-db-node-81fb644c-6 (13.51.6.132 | 10.0.3.179) (shards: 14)
- longevity-mv-si-4d-2022-1-db-node-81fb644c-5 (13.51.69.58 | 10.0.3.204) (shards: 14)
- longevity-mv-si-4d-2022-1-db-node-81fb644c-4 (13.51.170.112 | 10.0.3.196) (shards: 14)
- longevity-mv-si-4d-2022-1-db-node-81fb644c-3 (16.170.227.40 | 10.0.2.134) (shards: 14)
- longevity-mv-si-4d-2022-1-db-node-81fb644c-2 (13.49.224.82 | 10.0.3.90) (shards: 14)
- longevity-mv-si-4d-2022-1-db-node-81fb644c-1 (16.16.65.106 | 10.0.0.13) (shards: 14)
OS / Image: ami-0838dc54c055ad05a (aws: eu-north-1)
Test: longevity-mv-si-4days-test
Test id: 81fb644c-b1ac-42de-bd54-0ae2c4889180
Test name: enterprise-2022.1/longevity/longevity-mv-si-4days-test
Test config file(s):
-
Restore Monitor Stack command:
$ hydra investigate show-monitor 81fb644c-b1ac-42de-bd54-0ae2c4889180 -
Restore monitor on AWS instance using Jenkins job
-
Show all stored logs command:
$ hydra investigate show-logs 81fb644c-b1ac-42de-bd54-0ae2c4889180
Logs:
- db-cluster-81fb644c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/81fb644c-b1ac-42de-bd54-0ae2c4889180/20220526_143745/db-cluster-81fb644c.tar.gz
- monitor-set-81fb644c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/81fb644c-b1ac-42de-bd54-0ae2c4889180/20220526_143745/monitor-set-81fb644c.tar.gz
- loader-set-81fb644c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/81fb644c-b1ac-42de-bd54-0ae2c4889180/20220526_143745/loader-set-81fb644c.tar.gz
- sct-runner-81fb644c.tar.gz - https://cloudius-jenkins-test.s3.amazonaws.com/81fb644c-b1ac-42de-bd54-0ae2c4889180/20220526_143745/sct-runner-81fb644c.tar.gz
@juliayakovlev
seems like that node run of disk space, and the files are getting cleared before we have chance to upload them:
9440150:< t:2022-05-26 13:34:28,424 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 022-05-26T13:34:28+00:00 longevity-mv-si-4d-2022-1-db-node-81fb644c-8 ! INFO | Removed old coredump core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.
9456170:< t:2022-05-26 13:39:15,844 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-05-26T13:39:15+00:00 longevity-mv-si-4d-2022-1-db-node-81fb644c-8 ! INFO | Removed old coredump core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz.
@juliayakovlev
seems like that node run of disk space, and the files are getting cleared before we have chance to upload them:
9440150:< t:2022-05-26 13:34:28,424 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 022-05-26T13:34:28+00:00 longevity-mv-si-4d-2022-1-db-node-81fb644c-8 ! INFO | Removed old coredump core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000. 9456170:< t:2022-05-26 13:39:15,844 f:db_log_reader.py l:113 c:sdcm.db_log_reader p:DEBUG > 2022-05-26T13:39:15+00:00 longevity-mv-si-4d-2022-1-db-node-81fb644c-8 ! INFO | Removed old coredump core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz.
@fruch but uploading finished. Maybe we need to copy coredump immediatelly to runner and then upload. I understand that it's not so good solution, but it happened again and I can't let to developers all needed data
@juliayakovlev upload failed:
9446413:< t:2022-05-26 13:36:42,519 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > <h2>The requested URL <code>/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.16535
71792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz</code> was not found on this server.</h2>
9446414-< t:2022-05-26 13:36:42,519 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > <h2></h2>
9446415-< t:2022-05-26 13:36:42,519 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > </body></html>
that the return status code was 0, doesn't mean the upload succeeded.
@juliayakovlev upload failed:
9446413:< t:2022-05-26 13:36:42,519 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > <h2>The requested URL <code>/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.16535 71792000000000000/core.scylla.113.14c3edeca8594ea09fbe11bc49b1d7ae.2709.1653571792000000000000.gz</code> was not found on this server.</h2> 9446414-< t:2022-05-26 13:36:42,519 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > <h2></h2> 9446415-< t:2022-05-26 13:36:42,519 f:base.py l:228 c:RemoteLibSSH2CmdRunner p:DEBUG > </body></html>that the return status code was 0, doesn't mean the upload succeeded.
@fruch we need to handle it and raise exception
Julia, I've found the issue, it's not related to what's describe in this issue
seem like we need to change the url we are using
and yes we should fail on curl failures (we had a retry, long time ago)
@fruch do you know who changed that and why?
@fruch a I commented on the PR, I don't see any reference for such a change in scylla-docs. Moreover, we had several coredumps in the last few weeks and we didn't have that issue.
There's also possibility to stream directly to s3 without using curl/any specific binary on db nodes. See solution in (still in PR): https://github.com/scylladb/scylla-cluster-tests/pull/5122/files#diff-9039db3d9ae0506cbdbaa9ac8bac7bd2626fc58658769fe24c110af73732d40dR43
This issue is stale because it has been open 2 years with no activity. Remove stale label or comment or this will be closed in 2 days.