ASSERT(cm_->open_thread_local_device(idx) != nullptr) in src/core/rworker.cc
Hi,
I would like to ask you if this assertion is something you have experienced before? Before the assertion, there are some warning messages about the query port_id 1 on device 1 not being active.
In order to build the project, I used the suggested flags (cmake -DUSE_RDMA=1 -DONE_SIDED_READ=1 -DROCC_RBUF_SIZE_M=13240 -DRDMA_STORE_SIZE=5000 -DRDMA_CACHE=0 -DTX_LOG_STYLE=2).
When I run: ./run2.py config.xml noccocc "-t 24 -c 10 -r 100" bank 2 (I use the default config.xml and I have added two (2) hostnames in the hosts.xml file), I get the output below.
I have also set the use_port_ to be 0 in RWorker::choose_rnic_port() as suggested in #2, since I have 1 NIC per machine. Furthermore, I have done the change as described in #4.
I would appreciate any feedback.
Thank you.
Output:
NOCC started with program [noccocc]. at 08-06-2021 11:04:12 [bench_runner.cc:303] Use TCP port 33333 [bench_runner.cc:325] use scale factor: 24; with total 24 threads. [view.h:48] Start with 0 backups. [view.cc:10] total 2 backups to assign [Bank]: check workload 25, 15, 15, 15, 15, 15 [util.cc:167] huge page real size 12.9316G [rnic.hpp:60] query port_id 1 on device 1 not active. [bench_runner.cc:135] Total logger area 0.00390625G. [bench_runner.cc:146] add RDMA store size 4.88281G. [bench_runner.cc:156] First 4.88867G are left over. [bench_runner.cc:159] RDMA heap size 8.041G. [util.cc:167] huge page real size 0.294922G [util.cc:167] huge page real size 0.294922G [Bank], total 4800000 accounts loaded [bank_main.cc:262] check cv balance 46280 [Runner] local db size: 220.746 MB [Runner] Cache size: 0 MB [bench_runner.cc:210] backed list num: 0 [bench_listener2.cc:70] try log results to ./results/noccocc_bank_2_24_10_100.log [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rdma_ctrl_impl.hpp:82] wrong dev_id: -1; total 2 found [rworker.cc:106] Assertion! [rnic.hpp:60] query port_id 1 on device 1 not active. [NOCC] Meet an assertion failure! stack trace: [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. ./noccocc() [0x4c0225] /lib/x86_64-linux-gnu/libc.so.6 : ()+0x354c0 /lib/x86_64-linux-gnu/libc.so.6 : gsignal()+0x38 /lib/x86_64-linux-gnu/libc.so.6 : abort()+0x16a ./noccocc : nocc::MessageLogger::~MessageLogger()+0x2ee ./noccocc : nocc::oltp::RWorker::init_rdma(char*, unsigned long)+0x452 ./noccocc : nocc::oltp::BenchWorker::run()+0x2d1 ./noccocc : ndb_thread::pthread_bootstrap(void*)+0xf /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x76ba /lib/x86_64-linux-gnu/libc.so.6 : clone()+0x6d [ENDING] End benchmarks [ENDING] send ending messages in SIGINT handler [ENDING] kill processes node0 password: node1 password: kill try 0 node0 password: node1 password: Kill done [ENDING] kill processes done
Hi ,
According to
[rnic.hpp:60] query port_id 1 on device 1 not active.
It seems that the device on your machine is not active. Could you please check the output of ibstatus?
2021年6月9日 上午1:07,Antonis Psistakis @.***> 写道:
Hi,
I would like to ask you if this assertion is something you have experienced before? Before the assertion, there are some warning messages about the query port_id 1 on device 1 not being active.
In order to build the project, I used the suggested flags (cmake -DUSE_RDMA=1 -DONE_SIDED_READ=1 -DROCC_RBUF_SIZE_M=13240 -DRDMA_STORE_SIZE=5000 -DRDMA_CACHE=0 -DTX_LOG_STYLE=2).
When I run: ./run2.py config.xml noccocc "-t 24 -c 10 -r 100" bank 2 (I use the default config.xml and I have added two (2) hostnames in the hosts.xml file), I get the output below.
I have also set the use_port_ to be 0 in RWorker::choose_rnic_port() as suggested in #2 https://github.com/SJTU-IPADS/drtmh/issues/2, since I have 1 NIC per machine. Furthermore, I have done the change as described in #4 https://github.com/SJTU-IPADS/drtmh/issues/4.
I would appreciate any feedback.
Thank you.
Output:
NOCC started with program [noccocc]. at 08-06-2021 11:04:12 [bench_runner.cc:303] Use TCP port 33333 [bench_runner.cc:325] use scale factor: 24; with total 24 threads. [view.h:48] Start with 0 backups. [view.cc:10] total 2 backups to assign [Bank]: check workload 25, 15, 15, 15, 15, 15 [util.cc:167] huge page real size 12.9316G [rnic.hpp:60] query port_id 1 on device 1 not active. [bench_runner.cc:135] Total logger area 0.00390625G. [bench_runner.cc:146] add RDMA store size 4.88281G. [bench_runner.cc:156] First 4.88867G are left over. [bench_runner.cc:159] RDMA heap size 8.041G. [util.cc:167] huge page real size 0.294922G [util.cc:167] huge page real size 0.294922G [Bank], total 4800000 accounts loaded [bank_main.cc:262] check cv balance 46280 [Runner] local db size: 220.746 MB [Runner] Cache size: 0 MB [bench_runner.cc:210] backed list num: 0 [bench_listener2.cc:70] try log results to ./results/noccocc_bank_2_24_10_100.log [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rdma_ctrl_impl.hpp:82] wrong dev_id: -1; total 2 found [rworker.cc:106] Assertion! [rnic.hpp:60] query port_id 1 on device 1 not active. [NOCC] Meet an assertion failure! stack trace: [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. ./noccocc() [0x4c0225] /lib/x86_64-linux-gnu/libc.so.6 : ()+0x354c0 /lib/x86_64-linux-gnu/libc.so.6 : gsignal()+0x38 /lib/x86_64-linux-gnu/libc.so.6 : abort()+0x16a ./noccocc : nocc::MessageLogger::~MessageLogger()+0x2ee ./noccocc : nocc::oltp::RWorker::init_rdma(char*, unsigned long)+0x452 ./noccocc : nocc::oltp::BenchWorker::run()+0x2d1 ./noccocc : ndb_thread::pthread_bootstrap(void*)+0xf /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x76ba /lib/x86_64-linux-gnu/libc.so.6 : clone()+0x6d [ENDING] End benchmarks [ENDING] send ending messages in SIGINT handler [ENDING] kill processes 's password: 's password: kill try 0 's password: 's password: Kill done [ENDING] kill processes done
— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SJTU-IPADS/drtmh/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZVCEWMCU4MIVQHSFX3ZQDTRZE5LANCNFSM46KLRG7Q.
Hi,
Thanks for the reply.
Based on the output it seems the cluster has two devices per machine, and the port 1 of the second machine is inactive (at least that is my understanding) --I am afraid I do not have physical access to the cluster to confirm this, but I can double check this with someone who has. Is there a way to bypass this issue, i.e., use only one device & one port per machine?
Thank you.
The ibstatus on each machine returns the following:
Infiniband device 'mlx5_0' port 1 status: default gid: XXX base lid: 0x6 sm lid: 0x4 state: 4: ACTIVE phys state: 5: LinkUp rate: 100 Gb/sec (4X EDR) link_layer: InfiniBand
Infiniband device 'mlx5_1' port 1 status: default gid: XXX base lid: 0xffff sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 10 Gb/sec (4X SDR) link_layer: InfiniBand
Hi,
Thanks for sending me more information.
According to the results of ibstatus, the port 1 on the NIC is not available.
To specific which port used by each thread, you can customize the DrTM+H by modifying choose_rnic_port() in src/core/rworker.cc http://rworker.cc/ and use an active port.
This hopefully can fix your problem.
Thanks!
2021年6月9日 下午5:15,Antonis Psistakis @.***> 写道:
Hi,
Thanks for the reply.
Based on the output it seems the cluster has two devices per machine, and the port 1 of the second machine is inactive (at least that is my understanding) --I am afraid I do not have physical access to the cluster to confirm this, but I can double check this with someone who has. Is there a way to bypass this issue, i.e., use only one device & one port per machine?
Thank you.
The ibstatus on each machine returns the following:
Infiniband device 'mlx5_0' port 1 status: default gid: XXX base lid: 0x6 sm lid: 0x4 state: 4: ACTIVE phys state: 5: LinkUp rate: 100 Gb/sec (4X EDR) link_layer: InfiniBand
Infiniband device 'mlx5_1' port 1 status: default gid: XXX base lid: 0xffff sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 10 Gb/sec (4X SDR) link_layer: InfiniBand
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SJTU-IPADS/drtmh/issues/5#issuecomment-857530084, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZVCEXZ2LX6RBPPNT6ROT3TR4WJZANCNFSM46KLRG7Q.
Hi,
Thank you for your feedback.
Just to make sure I understand: if the name mlx5_X from the output I sent earlier shows the port number, then port 0 (mlx5_0) is the one that is active, correct?
If that is the case, as I mentioned earlier (first comment), I have set the use_port_ to be 0 in RWorker::choose_rnic_port() as suggested in #2. Is this your suggestion? I have tried this change before + re-building the project, but I get the same output.
Please let me know if I have misunderstood something.
Thank you.
Hi,
I think you understand correctly. It’s strange that using the first device not address the issue, because I’ve not met the same issue before. I’m sorry I could not help further if you are using the active device (i.e., dev_id = 0 & port_idx = 1) and the error reports.
Thanks.
2021年6月9日 下午8:05,Antonis Psistakis @.***> 写道:
Hi,
Thank you for your feedback.
Just to make sure I understand: if the name mlx5_X from the output I sent earlier shows the port number, then port 0 (mlx5_0) is the one that is active, correct?
If that is the case, as I mentioned earlier (first comment), I have set the use_port_ to be 0 in RWorker::choose_rnic_port() as suggested in #2 https://github.com/SJTU-IPADS/drtmh/issues/2. Is this your suggestion? I have tried this change before + re-building the project, but I get the same output.
Please let me know if I have misunderstood something.
Thank you.
— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SJTU-IPADS/drtmh/issues/5#issuecomment-857637258, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZVCEUMVZCJ2T632AGOLU3TR5KIRANCNFSM46KLRG7Q.
Hi,
Thanks for the feedback.
I tried the following and it seems it worked.
In the init_rdma() in src/core/rworker.cc, I set idx to be a fixed value (dev_id = 0, port_id=1), instead of using cm_->convert_port_idx(). More specifically:
RdmaCtrl::DevIdx idx = RdmaCtrl::DevIdx{.dev_id = 0, .port_id=1}
Thank you for your help! :)