drtmh icon indicating copy to clipboard operation
drtmh copied to clipboard

ASSERT(cm_->open_thread_local_device(idx) != nullptr) in src/core/rworker.cc

Open psistakis opened this issue 4 years ago • 6 comments

Hi,

I would like to ask you if this assertion is something you have experienced before? Before the assertion, there are some warning messages about the query port_id 1 on device 1 not being active.

In order to build the project, I used the suggested flags (cmake -DUSE_RDMA=1 -DONE_SIDED_READ=1 -DROCC_RBUF_SIZE_M=13240 -DRDMA_STORE_SIZE=5000 -DRDMA_CACHE=0 -DTX_LOG_STYLE=2).

When I run: ./run2.py config.xml noccocc "-t 24 -c 10 -r 100" bank 2 (I use the default config.xml and I have added two (2) hostnames in the hosts.xml file), I get the output below.

I have also set the use_port_ to be 0 in RWorker::choose_rnic_port() as suggested in #2, since I have 1 NIC per machine. Furthermore, I have done the change as described in #4.

I would appreciate any feedback.

Thank you.

Output:

NOCC started with program [noccocc]. at 08-06-2021 11:04:12 [bench_runner.cc:303] Use TCP port 33333 [bench_runner.cc:325] use scale factor: 24; with total 24 threads. [view.h:48] Start with 0 backups. [view.cc:10] total 2 backups to assign [Bank]: check workload 25, 15, 15, 15, 15, 15 [util.cc:167] huge page real size 12.9316G [rnic.hpp:60] query port_id 1 on device 1 not active. [bench_runner.cc:135] Total logger area 0.00390625G. [bench_runner.cc:146] add RDMA store size 4.88281G. [bench_runner.cc:156] First 4.88867G are left over. [bench_runner.cc:159] RDMA heap size 8.041G. [util.cc:167] huge page real size 0.294922G [util.cc:167] huge page real size 0.294922G [Bank], total 4800000 accounts loaded [bank_main.cc:262] check cv balance 46280 [Runner] local db size: 220.746 MB [Runner] Cache size: 0 MB [bench_runner.cc:210] backed list num: 0 [bench_listener2.cc:70] try log results to ./results/noccocc_bank_2_24_10_100.log [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rdma_ctrl_impl.hpp:82] wrong dev_id: -1; total 2 found [rworker.cc:106] Assertion! [rnic.hpp:60] query port_id 1 on device 1 not active. [NOCC] Meet an assertion failure! stack trace: [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. ./noccocc() [0x4c0225] /lib/x86_64-linux-gnu/libc.so.6 : ()+0x354c0 /lib/x86_64-linux-gnu/libc.so.6 : gsignal()+0x38 /lib/x86_64-linux-gnu/libc.so.6 : abort()+0x16a ./noccocc : nocc::MessageLogger::~MessageLogger()+0x2ee ./noccocc : nocc::oltp::RWorker::init_rdma(char*, unsigned long)+0x452 ./noccocc : nocc::oltp::BenchWorker::run()+0x2d1 ./noccocc : ndb_thread::pthread_bootstrap(void*)+0xf /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x76ba /lib/x86_64-linux-gnu/libc.so.6 : clone()+0x6d [ENDING] End benchmarks [ENDING] send ending messages in SIGINT handler [ENDING] kill processes node0 password: node1 password: kill try 0 node0 password: node1 password: Kill done [ENDING] kill processes done

psistakis avatar Jun 08 '21 17:06 psistakis

Hi ,

According to

[rnic.hpp:60] query port_id 1 on device 1 not active.

It seems that the device on your machine is not active. Could you please check the output of ibstatus?

2021年6月9日 上午1:07,Antonis Psistakis @.***> 写道:

Hi,

I would like to ask you if this assertion is something you have experienced before? Before the assertion, there are some warning messages about the query port_id 1 on device 1 not being active.

In order to build the project, I used the suggested flags (cmake -DUSE_RDMA=1 -DONE_SIDED_READ=1 -DROCC_RBUF_SIZE_M=13240 -DRDMA_STORE_SIZE=5000 -DRDMA_CACHE=0 -DTX_LOG_STYLE=2).

When I run: ./run2.py config.xml noccocc "-t 24 -c 10 -r 100" bank 2 (I use the default config.xml and I have added two (2) hostnames in the hosts.xml file), I get the output below.

I have also set the use_port_ to be 0 in RWorker::choose_rnic_port() as suggested in #2 https://github.com/SJTU-IPADS/drtmh/issues/2, since I have 1 NIC per machine. Furthermore, I have done the change as described in #4 https://github.com/SJTU-IPADS/drtmh/issues/4.

I would appreciate any feedback.

Thank you.

Output:

NOCC started with program [noccocc]. at 08-06-2021 11:04:12 [bench_runner.cc:303] Use TCP port 33333 [bench_runner.cc:325] use scale factor: 24; with total 24 threads. [view.h:48] Start with 0 backups. [view.cc:10] total 2 backups to assign [Bank]: check workload 25, 15, 15, 15, 15, 15 [util.cc:167] huge page real size 12.9316G [rnic.hpp:60] query port_id 1 on device 1 not active. [bench_runner.cc:135] Total logger area 0.00390625G. [bench_runner.cc:146] add RDMA store size 4.88281G. [bench_runner.cc:156] First 4.88867G are left over. [bench_runner.cc:159] RDMA heap size 8.041G. [util.cc:167] huge page real size 0.294922G [util.cc:167] huge page real size 0.294922G [Bank], total 4800000 accounts loaded [bank_main.cc:262] check cv balance 46280 [Runner] local db size: 220.746 MB [Runner] Cache size: 0 MB [bench_runner.cc:210] backed list num: 0 [bench_listener2.cc:70] try log results to ./results/noccocc_bank_2_24_10_100.log [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rdma_ctrl_impl.hpp:82] wrong dev_id: -1; total 2 found [rworker.cc:106] Assertion! [rnic.hpp:60] query port_id 1 on device 1 not active. [NOCC] Meet an assertion failure! stack trace: [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. [rnic.hpp:60] query port_id 1 on device 1 not active. ./noccocc() [0x4c0225] /lib/x86_64-linux-gnu/libc.so.6 : ()+0x354c0 /lib/x86_64-linux-gnu/libc.so.6 : gsignal()+0x38 /lib/x86_64-linux-gnu/libc.so.6 : abort()+0x16a ./noccocc : nocc::MessageLogger::~MessageLogger()+0x2ee ./noccocc : nocc::oltp::RWorker::init_rdma(char*, unsigned long)+0x452 ./noccocc : nocc::oltp::BenchWorker::run()+0x2d1 ./noccocc : ndb_thread::pthread_bootstrap(void*)+0xf /lib/x86_64-linux-gnu/libpthread.so.0 : ()+0x76ba /lib/x86_64-linux-gnu/libc.so.6 : clone()+0x6d [ENDING] End benchmarks [ENDING] send ending messages in SIGINT handler [ENDING] kill processes 's password: 's password: kill try 0 's password: 's password: Kill done [ENDING] kill processes done

— You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub https://github.com/SJTU-IPADS/drtmh/issues/5, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZVCEWMCU4MIVQHSFX3ZQDTRZE5LANCNFSM46KLRG7Q.

wxdwfc avatar Jun 09 '21 01:06 wxdwfc

Hi,

Thanks for the reply.

Based on the output it seems the cluster has two devices per machine, and the port 1 of the second machine is inactive (at least that is my understanding) --I am afraid I do not have physical access to the cluster to confirm this, but I can double check this with someone who has. Is there a way to bypass this issue, i.e., use only one device & one port per machine?

Thank you.

The ibstatus on each machine returns the following:

Infiniband device 'mlx5_0' port 1 status: default gid: XXX base lid: 0x6 sm lid: 0x4 state: 4: ACTIVE phys state: 5: LinkUp rate: 100 Gb/sec (4X EDR) link_layer: InfiniBand

Infiniband device 'mlx5_1' port 1 status: default gid: XXX base lid: 0xffff sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 10 Gb/sec (4X SDR) link_layer: InfiniBand

psistakis avatar Jun 09 '21 09:06 psistakis

Hi,

Thanks for sending me more information.

According to the results of ibstatus, the port 1 on the NIC is not available.

To specific which port used by each thread, you can customize the DrTM+H by modifying choose_rnic_port() in src/core/rworker.cc http://rworker.cc/ and use an active port. This hopefully can fix your problem.

Thanks!

2021年6月9日 下午5:15,Antonis Psistakis @.***> 写道:

Hi,

Thanks for the reply.

Based on the output it seems the cluster has two devices per machine, and the port 1 of the second machine is inactive (at least that is my understanding) --I am afraid I do not have physical access to the cluster to confirm this, but I can double check this with someone who has. Is there a way to bypass this issue, i.e., use only one device & one port per machine?

Thank you.

The ibstatus on each machine returns the following:

Infiniband device 'mlx5_0' port 1 status: default gid: XXX base lid: 0x6 sm lid: 0x4 state: 4: ACTIVE phys state: 5: LinkUp rate: 100 Gb/sec (4X EDR) link_layer: InfiniBand

Infiniband device 'mlx5_1' port 1 status: default gid: XXX base lid: 0xffff sm lid: 0x0 state: 1: DOWN phys state: 3: Disabled rate: 10 Gb/sec (4X SDR) link_layer: InfiniBand

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SJTU-IPADS/drtmh/issues/5#issuecomment-857530084, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZVCEXZ2LX6RBPPNT6ROT3TR4WJZANCNFSM46KLRG7Q.

wxdwfc avatar Jun 09 '21 11:06 wxdwfc

Hi,

Thank you for your feedback.

Just to make sure I understand: if the name mlx5_X from the output I sent earlier shows the port number, then port 0 (mlx5_0) is the one that is active, correct?

If that is the case, as I mentioned earlier (first comment), I have set the use_port_ to be 0 in RWorker::choose_rnic_port() as suggested in #2. Is this your suggestion? I have tried this change before + re-building the project, but I get the same output.

Please let me know if I have misunderstood something.

Thank you.

psistakis avatar Jun 09 '21 12:06 psistakis

Hi,

I think you understand correctly. It’s strange that using the first device not address the issue, because I’ve not met the same issue before. I’m sorry I could not help further if you are using the active device (i.e., dev_id = 0 & port_idx = 1) and the error reports.

Thanks.

2021年6月9日 下午8:05,Antonis Psistakis @.***> 写道:

Hi,

Thank you for your feedback.

Just to make sure I understand: if the name mlx5_X from the output I sent earlier shows the port number, then port 0 (mlx5_0) is the one that is active, correct?

If that is the case, as I mentioned earlier (first comment), I have set the use_port_ to be 0 in RWorker::choose_rnic_port() as suggested in #2 https://github.com/SJTU-IPADS/drtmh/issues/2. Is this your suggestion? I have tried this change before + re-building the project, but I get the same output.

Please let me know if I have misunderstood something.

Thank you.

— You are receiving this because you commented. Reply to this email directly, view it on GitHub https://github.com/SJTU-IPADS/drtmh/issues/5#issuecomment-857637258, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAZVCEUMVZCJ2T632AGOLU3TR5KIRANCNFSM46KLRG7Q.

wxdwfc avatar Jun 09 '21 12:06 wxdwfc

Hi,

Thanks for the feedback.

I tried the following and it seems it worked.

In the init_rdma() in src/core/rworker.cc, I set idx to be a fixed value (dev_id = 0, port_id=1), instead of using cm_->convert_port_idx(). More specifically:

RdmaCtrl::DevIdx idx = RdmaCtrl::DevIdx{.dev_id = 0, .port_id=1}

Thank you for your help! :)

psistakis avatar Jun 09 '21 17:06 psistakis