perftest icon indicating copy to clipboard operation
perftest copied to clipboard

SRD CQ notifications

Open gal-pressman opened this issue 3 years ago • 9 comments

Add support for SRD CQ notifications. The first patch fixes a bug where the same completion channel is used for both TX and RX and messes up the test.

gal-pressman avatar Oct 11 '21 07:10 gal-pressman

Hi,

Can you please add more details what issue the first patch actually resolves?

Thanks

amitgeron avatar Feb 03 '22 10:02 amitgeron

IIRC, the test uses the same completion channel and does:

  • Send packet
  • Remote side receives the packet and send another packet
  • Receive packet from remote

And assumes that the next event is necessarily from the send completion, but it could theoretically be from the receive completion.

@firasj might remember the exact details.

gal-pressman avatar Feb 06 '22 09:02 gal-pressman

The unidirectional BW test for the client (run_iter_bw) is not supposed to receive data at all, so an event should never be delivered for the recv_cq, and the server side (run_iter_bw_server) isn't supposed to send any data, so an event should never be delivered for the send_cq.

Did you encounter any bug or issues when running the tests?

amitgeron avatar Feb 06 '22 10:02 amitgeron

It was a real issue, some tests (send_lat for example) do both send and receive, and they might fail without this fix.

gal-pressman avatar Feb 10 '22 06:02 gal-pressman

As @galpress has already mentioned, this is a real issue in latency tests that end up crashing the test.

We do not check if the completion is a send or receive. This may lead to assuming that we received a Tx completion when in fact it was an Rx completion, and in turn posting a Tx WQE without polling the TX CQ. Because the queue depth is 1 in latency tests, this results in an error (at least in EFA provider).

firasj avatar Feb 15 '22 18:02 firasj

@HassanKhadour Can you please merge this PR or suggest some extra improvements? Thanks!

mrgolin avatar Jul 24 '22 09:07 mrgolin

@firasj could you please provide the case that will reproduce this issue with the command you used?

sshaulnv avatar Jul 31 '22 10:07 sshaulnv

@firasj can you please provide the command lines we were running? @sshaulnv what is needed to get this merged? Do you want me to test this on mlx5 hardware?

gal-pressman avatar Sep 01 '22 16:09 gal-pressman

Hi Gal, yes we need the commands in order to test it on mlx5 hardware

HassanKhadour avatar Sep 05 '22 14:09 HassanKhadour

@gal-pressman Can you please close this PR for now? It's been on a low priority and we'll reopen once we have a reproduction and an answer.

firasj avatar Nov 27 '22 09:11 firasj

Hi, running:

for i in {1..20}; do ib_send_lat -c SRD -x 0 -F --report_gbits -a -n 10000 -r 1 -e & ib_send_lat -c SRD -x 0 -F --report_gbits -a -n 10000 -r 1 -e localhost; done

is reproducing the issue on EFA HW. @sshaulnv @gal-pressman can you please test this with mlx5 HW?

mrgolin avatar Mar 09 '23 09:03 mrgolin

I ran the same command with mlx5 (UD instead of SRD), it works both with and without this patch, should be safe to merge.

gal-pressman avatar Mar 09 '23 10:03 gal-pressman

@HassanKhadour Can you please merge this?

mrgolin avatar May 03 '23 16:05 mrgolin

Hi @mrgolin , conflicts should be resolved to merge.

HassanKhadour avatar May 09 '23 06:05 HassanKhadour

@HassanKhadour I've rebased and pushed as new PR https://github.com/linux-rdma/perftest/pull/211.

mrgolin avatar May 21 '23 12:05 mrgolin