Bug(tcmalloc):ALL nodes coredump when doing Bulkload
Bug Report
-
What did you do? Doing bulkload (download sst file stage) with any action which need to restart ONE node,may cause ALL nodes coredump.
-
What did you see ? There are three kind of coredump in different nodes Type one:
(gdb) #0 0x00007fc9093669ef in signalHandler(int, siginfo*, void*) ()
from /opt/soft/openjdk1.8.0/jre/lib/amd64/server/libjvm.so
#1 <signal handler called>
#2 0x00007fc9072ea793 in dsn::utils::filesystem::get_normalized_path (
path=..., npath=...)
at /home/work/temp/pegasus/src/rdsn/src/utils/filesystem.cpp:116
#3 0x00007fc9072ebb4d in dsn::utils::filesystem::path_combine (path1=...,
path2=...)
at /home/work/temp/pegasus/src/rdsn/src/utils/filesystem.cpp:618
#4 0x00007fc9085696e7 in dsn::replication::replica_bulk_loader::download_sst_file (this=0x5a947a40, remote_dir=..., local_dir=...,
file_index=<optimized out>, fs=0x275f74d0)
at /home/work/temp/pegasus/src/rdsn/src/replica/bulk_load/replica_bulk_loader.cpp:460
#5 0x00007fc9086b7811 in dsn::task::exec_internal (
this=this@entry=0xa95362d0)
at /home/work/temp/pegasus/src/rdsn/src/runtime/task/task.cpp:176
#6 0x00007fc9086ccec2 in dsn::task_worker::loop (this=0x24e31e0)
at /home/work/temp/pegasus/src/rdsn/src/runtime/task/task_worker.cpp:224
#7 0x00007fc9086cd040 in dsn::task_worker::run_internal (this=0x24e31e0)
at /home/work/temp/pegasus/src/rdsn/src/runtime/task/task_worker.cpp:204
#8 0x00007fc90734ba1f in execute_native_thread_routine ()
from /home/work/app/pegasus/alsgsrv-monetization-master/replica/package/bin/libdsn_utils.so
#9 0x00007fc905156dc5 in start_thread () from /lib64/libpthread.so.0
#10 0x00007fc90365573d in clone () from /lib64/libc.so.6
(gdb) quit
Type two:
#0 0x00007f8d106371d7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) #0 0x00007f8d106371d7 in raise () from /lib64/libc.so.6
#1 0x00007f8d106388c8 in abort () from /lib64/libc.so.6
#2 0x00007f8d12435dbb in tcmalloc::Log (mode=mode@entry=tcmalloc::kCrash,
filename=filename@entry=0x7f8d1244c14e "src/tcmalloc.cc",
line=line@entry=332, a=..., b=..., c=..., d=...)
at src/internal_logging.cc:118
#3 0x00007f8d124281d9 in (anonymous namespace)::InvalidFree (
ptr=<optimized out>) at src/tcmalloc.cc:332
#4 0x00007f8d16adaef3 in _dl_update_slotinfo ()
from /lib64/ld-linux-x86-64.so.2
#5 0x00007f8d16aca136 in update_get_addr () from /lib64/ld-linux-x86-64.so.2
#6 0x00007f8d1438e71c in dsn::utils::filesystem::get_normalized_path (
path=..., npath=...)
at /home/work/temp/pegasus/src/rdsn/src/utils/filesystem.cpp:116
#7 0x00007f8d1438fb4d in dsn::utils::filesystem::path_combine (path1=...,
path2=...)
at /home/work/temp/pegasus/src/rdsn/src/utils/filesystem.cpp:618
#8 0x00007f8d1560d6e7 in dsn::replication::replica_bulk_loader::download_sst_file (this=0x35ba0c0, remote_dir=..., local_dir=...,
file_index=<optimized out>, fs=0x8090b1500)
at /home/work/temp/pegasus/src/rdsn/src/replica/bulk_load/replica_bulk_loader.cpp:460
#9 0x00007f8d1575b811 in dsn::task::exec_internal (
this=this@entry=0x8d3964f00)
at /home/work/temp/pegasus/src/rdsn/src/runtime/task/task.cpp:176
#10 0x00007f8d15770ec2 in dsn::task_worker::loop (this=0x1f32fd0)
at /home/work/temp/pegasus/src/rdsn/src/runtime/task/task_worker.cpp:224
#11 0x00007f8d15771040 in dsn::task_worker::run_internal (this=0x1f32fd0)
at /home/work/temp/pegasus/src/rdsn/src/runtime/task/task_worker.cpp:204
#12 0x00007f8d143efa1f in execute_native_thread_routine ()
from /home/work/app/pegasus/alsgsrv-monetization-master/replica/package/bin/libdsn_utils.so
#13 0x00007f8d121fadc5 in start_thread () from /lib64/libpthread.so.0
#14 0x00007f8d106f973d in clone () from /lib64/libc.so.6
(gdb) quit
Type three:
Program terminated with signal 6, Aborted.
#0 0x00007fad388fa1d7 in raise () from /lib64/libc.so.6
Missing separate debuginfos, use: debuginfo-install bzip2-libs-1.0.6-13.el7.x86_64 elfutils-libelf-0.166-2.el7.x86_64 elfutils-libs-0.166-2.el7.x86_64 glibc-2.17-157.el7_3.1.x86_64 keyutils-libs-1.5.8-3.el7.x86_64 krb5-libs-1.14.1-27.el7_3.x86_64 libattr-2.4.46-12.el7.x86_64 libcap-2.22-8.el7.x86_64 libcom_err-1.42.9-9.el7.x86_64 libgcc-4.8.5-28.el7_5.1.x86_64 libselinux-2.5-6.el7.x86_64 pcre-8.32-15.el7_2.1.x86_64 systemd-libs-219-30.el7_3.8.x86_64 xz-libs-5.2.2-1.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) #0 0x00007fad388fa1d7 in raise () from /lib64/libc.so.6
#1 0x00007fad388fb8c8 in abort () from /lib64/libc.so.6
#2 0x00007fad3925ca8d in __gnu_cxx::__verbose_terminate_handler() ()
from /home/work/app/pegasus/alsgsrv-monetization-master/replica/package/bin/libstdc++.so.6
#3 0x00007fad3925abe6 in ?? ()
from /home/work/app/pegasus/alsgsrv-monetization-master/replica/package/bin/libstdc++.so.6
#4 0x00007fad3925ac13 in std::terminate() ()
from /home/work/app/pegasus/alsgsrv-monetization-master/replica/package/bin/libstdc++.so.6
#5 0x00007fad3c6b2a4e in execute_native_thread_routine ()
from /home/work/app/pegasus/alsgsrv-monetization-master/replica/package/bin/libdsn_utils.so
#6 0x00007fad3a4bddc5 in start_thread () from /lib64/libpthread.so.0
#7 0x00007fad389bc73d in clone () from /lib64/libc.so.6
(gdb) quit
- What version of Pegasus are you using? v2.4
The aforementioned phenomenon is one of issues triggered by bulkload download.
Phenomenon
- Doing bulkload (download sst file stage) with any action which need to restart ONE node,may cause ALL nodes coredump.
- bulkload file missing also cause many nodes coredump
- Phenomenon 1 and 2 both cause tcmalloc report
large alloc 2560917504
Reason
After execute the clear_bulk_load_states function, download_sst_file tasks still remain, which causes above phenomenon.
Case 1 With Phenomenon 1
operation:we restart one node.
ballot increase,function clear_bulk_load_states_if_needed() clear 88.5 replica _metadata.files at 15:17:56.753
D2024-05-20 15:17:56.753 (1716189476753079718 146668) replica.replica13.0404000d0000005d: replica_config.cpp:819:update_local_configuration(): [email protected]:27101: update ballot to init file from 3 to 4 OK
D2024-05-20 15:17:56.753 (1716189476753147052 146668) replica.replica13.0404000d0000005d:clear_bulk_load_states_if_needed(): [[email protected]:27101] prepare to clear bulk load states, current status = replication::bulk_load_status::BLS_DOWNLOADING
D2024-05-20 15:17:56.753 (1716189476753464144 146668) replica.replica13.0404000d0000005d: replica_config.cpp:1045:update_local_configuration(): [email protected]:27101: status change replication::partition_status::PS_INACTIVE @ 3 => replication::partition_status::PS_PRIMARY @ 4, pre(1, 0), app(0, 0), duration = 3 ms, replica_configuration(pid=88.5, ballot=4, primary=10.142.98.52:27101, status=3, learner_signature=0, pop_all=0, split_sync_to_child=0)
**But at 15:17:56.873, the 88.5 replicais still downloading sst file, cause core. **
D2024-05-20 15:17:56.873 (1716189476873362400 146626) replica.default7.04010007000000ca: block_service_manager.cpp:181:download_file(): download file(/home/work/ssd2/pegasus/c3tst-performance2/replica/reps/88.5.pegasus/bulk_load/33.sst) succeed, file_size = 65930882, md5 = 7a4d3da9250f52b4e31095c1d7042c2f D2024-05-20 15:17:58.348 (1716189478348326864 146626) replica.default7.04010007000000ca: replica_bulk_loader.cpp:479:download_sst_file(): [[email protected]:27101] download_sst_file remote_dir /user/s_pegasus/lpfsplit/c3tst-performance2/ingest_p32_10G/5 ,local_dir /home/work/ssd2/pegasus/c3tst-performance2/replica/reps/88.5.pegasus/bulk_load,f_meta.name 33.sst
Case 2 With Phenomenon 2
operation:app ingest_p4_10G partition 1,bulkload file missing 88.sst,89.sst,90.sst,93.sst
[general]
app_name : ingest_p4_10G
app_id : 95
partition_count : 4
max_replica_count : 3
[replicas]
pidx ballot replica_count primary secondaries
0 8 3/3 c3-hadoop-pegasus-tst-st01.bj:27101 [c3-hadoop-pegasus-tst-st03.bj:27101,c3-hadoop-pegasus-tst-st05.bj:27101]
1 7 3/3 c3-hadoop-pegasus-tst-st03.bj:27101 [c3-hadoop-pegasus-tst-st01.bj:27101,c3-hadoop-pegasus-tst-st02.bj:27101]
2 8 3/3 c3-hadoop-pegasus-tst-st04.bj:27101 [c3-hadoop-pegasus-tst-st03.bj:27101,c3-hadoop-pegasus-tst-st02.bj:27101]
3 3 3/3 c3-hadoop-pegasus-tst-st01.bj:27101 [c3-hadoop-pegasus-tst-st03.bj:27101,c3-hadoop-pegasus-tst-st04.bj:27101]
primary replica
primary replica failed to download file(88.sst) ,and stop downloading all sst file.
log.1.txt:E2024-05-22 14:28:11.231 (1716359291231595252 102084) replica.default1.040100090000072b: replica_bulk_loader.cpp:520:download_sst_file(): [[email protected]:27101] failed to download file(88.sst), error = ERR_CORRUPTION
But meta says continue downloading.
D2024-05-22 14:28:18.983 (1716359298983653491 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:71:on_bulk_load(): [[email protected]:27101] receive bulk load request, remote provider = hdfs_zjy, remote_root_path = /user/s_pegasus/lpfsplit, cluster_name = c3tst-performance2, app_name = ingest_p4_10G, meta_bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING, local bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING
primary replica reports download progress to meta.
D2024-05-22 14:28:18.983 (1716359298983689828 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:879:report_group_download_progress(): [[email protected]:27101] primary = 10.142.102.47:27101, download progress = 89%, status = ERR_CORRUPTION
D2024-05-22 14:28:18.983 (1716359298983703147 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:892:report_group_download_progress(): [[email protected]:27101] secondary = 10.142.98.52:27101, download progress = 88%, status=ERR_OK
D2024-05-22 14:28:18.983 (1716359298983714700 102121) replica.replica2.04008ebc00010f3f: replica_bulk_loader.cpp:892:report_group_download_progress(): [[email protected]:27101] secondary = 10.142.97.9:27101, download progress = 88%, status=ERR_OK
meta says stop downloading,and clear _metadata.files. However, all download tasks were not terminated successfully.
D2024-05-22 14:28:28.988 (1716359308988487559 102121) replica.replica2.04008ebc00010f46: replica_bulk_loader.cpp:71:on_bulk_load(): [[email protected]:27101] receive bulk load request, remote provider = hdfs_zjy, remote_root_path = /user/s_pegasus/lpfsplit, cluster_name = c3tst-performance2, app_name = ingest_p4_10G, meta_bulk_load_status = replication::bulk_load_status::BLS_FAILED, local bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING
At 14:28:29, download_sst_file task still exists, access _metadata.files, causing core.
D2024-05-22 14:28:29.529 (1716359309529341231 102089) replica.default6.04010000000007b5: replica_bulk_loader.cpp:479:download_sst_file(): [[email protected]:27101] download_sst_file remote_dir /user/s_pegasus/lpfsplit/c3tst-performance2/ingest_p4_10G/0 ,local_dir /home/work/ssd1/pegasus/c3tst-performance2/replica/reps/95.0.pegasus/bulk_load,f_meta.name 92.sst
F2024-05-22 14:28:29.536 (1716359309536349665 102089) replica.default6.04010000000007b5: filesystem.cpp:111:get_normalized_path(): assertion expression: len <= 4086
secondary replica
Secondary receives primary replica message to cancel the bulkload task and clear _metadata.files, but does not terminate all download tasks, cause core dump.
Other replicas generate core dumps due to this reason, cause many replica server core dump.
D2024-05-22 14:28:28.992 (1716359308992917139 159129) replica.replica2.04006d6800010139: replica_bulk_loader.cpp:183:on_group_bulk_load(): [[email protected]:27101] receive group_bulk_load request, primary address = 10.142.102.47:27101, ballot = 7, **meta bulk_load_status = replication::bulk_load_status::BLS_FAILED, local bulk_load_status = replication::bulk_load_status::BLS_DOWNLOADING**
D2024-05-22 14:28:30.384 (1716359310384983585 159094) replica.default5.040100080000056a: replica_bulk_loader.cpp:479:download_sst_file(): [[email protected]:27101] download_sst_file remote_dir /user/s_pegasus/lpfsplit/c3tst-performance2/ingest_p4_10G/0 ,local_dir /home/work/ssd2/pegasus/c3tst-performance2/replica/reps/95.0.pegasus/bulk_load,f_meta.name 20
F2024-05-22 14:28:30.452 (1716359310452229248 159094) replica.default5.040100080000056a: filesystem.cpp:111:get_normalized_path(): assertion expression: len <= 4086
tcmalloc report large alloc
_metadata.files is cleared, cause f_meta.name length in download_sst_file function is very long.
const file_meta &f_meta = _metadata.files[file_index];
const std::string &file_name = utils::filesystem::path_combine(local_dir, f_meta.name);
log.1.txt:F2024-05-20 17:22:49.621 (1716196969621641630 170503) replica.default11.0401000b000000de: filesystem.cpp:111:get_normalized_path(): lpf path chao chu 4096, get_normalized_path LEN 410828079
log.2.txt:F2024-05-20 17:23:38.595 (1716197018595730772 192879) replica.default10.040100040000002e: filesystem.cpp:111:get_normalized_path(): lpf path chao chu 4096, get_normalized_path LEN 532715376
log.1.txt:F2024-05-20 17:22:50.77 (1716196970077285996 164438) replica.default11.0401000b000000c6: filesystem.cpp:111:get_normalized_path(): lpf path chao chu 4096, get_normalized_path LEN 383022703