scylladb
scylladb copied to clipboard
Segmentation fault on shard 1
Upload UUID = 94a706fe-0906-4f16-8dd8-f3af4a578a60 Installation details Scylla version : 5.0.1 Cluster size: 9 OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu
I stopped Scylla and deleted some corrupt files then restarted it. The node go up for about one minute and then crashes generating the core dump file with the UUID mentioned above. Looking into the logs i can see the following :
Sep 07 10:06:05 hostname scylla[2625116]: Segmentation fault on shard 1. Sep 07 10:06:05 hostname scylla[2625116]: Backtrace: Sep 07 10:06:05 hostname scylla[2625116]: 0x42c5a18 Sep 07 10:06:05 hostname scylla[2625116]: 0x42f64f6 Sep 07 10:06:05 hostname scylla[2625116]: 0x4945a51 Sep 07 10:06:05 hostname scylla[2625116]: 0x7ffff70e2a1f Sep 07 10:06:05 hostname scylla[2625116]: 0x1b12240 Sep 07 10:06:05 hostname scylla[2625116]: 0x42d6974 Sep 07 10:06:05 hostname scylla[2625116]: 0x42d7d47 Sep 07 10:06:05 hostname scylla[2625116]: 0x42f6a65 Sep 07 10:06:05 hostname scylla[2625116]: 0x42aa8aa Sep 07 10:06:05 hostname scylla[2625116]: /opt/scylladb/libreloc/libpthread.so.0+0x9298 Sep 07 10:06:05 hostname scylla[2625116]: /opt/scylladb/libreloc/libc.so.6+0x100352
The issue was due to a corrupt sstable file. Deleting the files of that sstable seems to have resolved the issue. But would it be worth handling those situations better?
Having corrupted sstablesis something we would like to check can you please upload the corrupted sstables as well
The issue was due to a corrupt sstable file. Deleting the files of that sstable seems to have resolved the issue. But would it be worth handling those situations better?
The table was "businessobjectchanges2" but we don't have the deleted file anymore.
Upload UUID = 94a706fe-0906-4f16-8dd8-f3af4a578a60 Installation details Scylla version : 5.0.1 Cluster size: 9 OS (RHEL/CentOS/Ubuntu/AWS AMI): Ubuntu
I stopped Scylla and deleted some corrupt files then restarted it. The node go up for about one minute and then crashes generating the core dump file with the UUID mentioned above. Looking into the logs i can see the following :
Sep 07 10:06:05 hostname scylla[2625116]: Segmentation fault on shard 1. Sep 07 10:06:05 hostname scylla[2625116]: Backtrace: Sep 07 10:06:05 hostname scylla[2625116]: 0x42c5a18 Sep 07 10:06:05 hostname scylla[2625116]: 0x42f64f6 Sep 07 10:06:05 hostname scylla[2625116]: 0x4945a51 Sep 07 10:06:05 hostname scylla[2625116]: 0x7ffff70e2a1f Sep 07 10:06:05 hostname scylla[2625116]: 0x1b12240 Sep 07 10:06:05 hostname scylla[2625116]: 0x42d6974 Sep 07 10:06:05 hostname scylla[2625116]: 0x42d7d47 Sep 07 10:06:05 hostname scylla[2625116]: 0x42f6a65 Sep 07 10:06:05 hostname scylla[2625116]: 0x42aa8aa Sep 07 10:06:05 hostname scylla[2625116]: /opt/scylladb/libreloc/libpthread.so.0+0x9298 Sep 07 10:06:05 hostname scylla[2625116]: /opt/scylladb/libreloc/libc.so.6+0x100352
The core looks wrong. Only 11M in size (decompressed).
@MokhlesHm could you please resend?
@MokhlesHm how did you figure out the files were corrupted? A message in the log? Please share logs too.
@raphaelsc
attached you can see the logs which lead us to the corrupt file.
about the core size, that was the actual size of the core dump. our assumption is because scylla just started and crashed quickly that didn't load so much data.

@raphaelsc attached you can see the logs which lead us to the corrupt file. about the core size, that was the actual size of the core dump. our assumption is because scylla just started and crashed quickly that didn't load so much data.
Perhaps ulimit -c is configured with a small value? I don't think the generated core is valid. Even if memory is uninitialized, it should still be contained in the core. Then compression would significantly reduce its size. But 11M is the uncompressed size.
Full logs (as a text file) would help better. I want to extract the build id for decoding that backtrace. Please upload the logs to that UUID you shared.
You're running on Ubuntu. Which file system? Which kernel version?
Next time, you bump into corrupted files, please move them around (don't copy as we want to preseve the file system inode. for example, we may want to check the inode extents, etc) as they're helpful for understanding the root cause.
$ eu-unstrip -n --core=./core.dsc_host.996.bce0fd5b34ea4bc5b47da497430692aa.2643348.1662554733000000000000
0x400000+0x2c4000 33825d2dbce1af4d7012c8bba2f4c752c6e1dc56@0x400284 - - /opt/dsc/bin/dsc_host
0x7ffff74de000+0x1f2000 1878e6b475720c7c51969e69ab2d276fae6d1dee@0x7ffff74de380 - - /usr/lib/x86_64-linux-gnu/libc-2.31.so
0x7ffff79c0000+0x23000 7b4536f41cdaa5888408e82d0836e33dcf436466@0x7ffff79c0348 - - /usr/lib/x86_64-linux-gnu/libpthread-2.31.so
0x7ffff7cf2000+0x2d5000 bdcdac951c8ebd463c6b784372eca499e6506283@0x7ffff7cf21a0 - - /opt/omi/lib/libmi.so
0x7ffff7fcd000+0x1000 c65b0159458c881cf1b36dec99cd95049f31a722@0x7ffff7fcd540 . - linux-vdso.so.1
0x7ffff7fcf000+0x30000 4587364908de169dec62ffa538170118c1c3a078@0x7ffff7fcf2d8 - - /usr/lib/x86_64-linux-gnu/ld-2.31.so
Looks like that core belongs to dsc_host. Do you have the correct one for Scylla crash? Perhaps ulimit -c configuration prevented scylla's core from being generated. Please confirm.
@raphaelsc Ulimit -c is set to unlimited and should not prevent the core from being generated.
@raphaelsc Ulimit -c is set to unlimited and should not prevent the core from being generated.
Then please check if the coredump for Scylla crash is available.
the only available files in the coredump directory are similar to the one i sent,also with the same size. and are generated during the crash.
@raphaelsc
You're running on Ubuntu. Which file system? Which kernel version?
Linux 5.11.0-1022-azure #23~20.04.1-Ubuntu x86_64 x86_64 x86_64 GNU/Linux
@raphaelsc ping
@MokhlesHm - Unfortunately I am unable to make any progress without a coredump that actually belongs to Scylla core. Using eu-unstrip on the core provided showed it belongs to a different program that segfaulted.
@MokhlesHm - Unfortunately I am unable to make any progress without a coredump that actually belongs to Scylla core. Using eu-unstrip on the core provided showed it belongs to a different program that segfaulted.
Closing for the time being. @MokhlesHm - if you can get a core dump, that'd be great!