perftest
perftest copied to clipboard
one direction bandwidth testing fail with GPUdirect
Hello,
I am testing my 2 P100 in 2 nodes with 2 cx555 NICs. It is only successful from one direction but failed in the other. Success ./ib_write_bw --use_cuda=0 -a 10.10.10.11 ./ib_write_bw -d mlx5_0 --use_cuda=0 -a
Fail ./ib_write_bw --use_cuda=0 -a ethernet_read_keys: Couldn't read remote address Unable to read to socket/rdma_cm Failed to exchange data between server and clients
./ib_write_bw -d mlx5_0 --use_cuda=0 -a 10.10.10.10 Completion with error at client Failed status 4: wr_id 0 syndrom 0x51 scnt=128, ccnt=0 Failed to complete run_iter_bw function successfully
For the testing between both cx555 NICs the bandwidth testings work well.
Driver and Kernel: Both cx555 are the same driver and firmware Both P100 are th same driver but different vbios I am not using Nvidia open source kernel since P100 is not supported but i think it is not the problem of the kernel otherwise why one direction is still working.
For IOMMU 10.10.10.11 sudo dmesg | grep -i dmar [ 0.173076] DMAR: IOMMU disabled sudo dmesg | grep -i iommu [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=44a5d7a3-4f19-4106-8a8c-66301c2c9d14 ro intel_iommu=off quiet splash vt.handoff=7 [ 0.173010] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=44a5d7a3-4f19-4106-8a8c-66301c2c9d14 ro intel_iommu=off quiet splash vt.handoff=7 [ 0.173076] DMAR: IOMMU disabled [ 2.245922] iommu: Default domain type: Translated [ 2.245922] iommu: DMA domain TLB invalidation policy: lazy mode
10.10.10.10 sudo dmesg | grep -i dmar No iputput sudo dmesg | grep -i iommu [ 0.000000] Command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=6e849e25-4931-4c06-8684-bb553962f200 ro amd_iommu=off quiet splash vt.handoff=7 [ 0.030879] Kernel command line: BOOT_IMAGE=/boot/vmlinuz-6.8.0-41-generic root=UUID=6e849e25-4931-4c06-8684-bb553962f200 ro amd_iommu=off quiet splash vt.handoff=7 [ 1.861879] iommu: Default domain type: Translated [ 1.861879] iommu: DMA domain TLB invalidation policy: lazy mode i have set both iommu=off in the kernel but ouput are different.
What will the possible casue for this issue and how can i go deep to find the casue and find the solution.
Thanks