onload icon indicating copy to clipboard operation
onload copied to clipboard

The latency of ef_eventq_poll() and ef_vi_transmit() is not stable.

Open Roser-ZY opened this issue 3 years ago • 14 comments

The latency of above two funcitons sometimes reaches to 2000ns~10000ns, and the frequency is uncertain. Is that normal or it should not appear?

Thanks for help.

Roser-ZY avatar Oct 13 '21 08:10 Roser-ZY

Can you tell us more about your hardware and software configuration?

On a well-tuned system the runtime of both of those functions should be tiny (probably <100ns, I haven't measured it recently), but here are a few ways in which it can go bad:

  • Running in a VM
  • A bad power management config so that your CPU keeps throttling
  • Context switches
  • Using AF_XDP rather than a Xilinx NIC, the performance depends on both the kernel and the NIC's driver
  • Running a debug build
  • Bugs in your measurement technique, e.g. counting more than you thought

rhughes-xilinx avatar Oct 13 '21 09:10 rhughes-xilinx

Thank you so much for such a quick and detailed reply! I call the function ef_eventq_poll() to handle events and the usage of the function is similar to the following:

void feedback() {
   ...
    while(1) {

        // Record the time.

        n_event = ef_eventq_poll(&vi, evs, sizeof(evs) / sizeof(evs[0])); // Just as the usage in send.c.

        // Record the time,
        //  and then I calculate the difference.

        ...
    }
    ...
}

and the function feedback() runs as a thread.

The challenge is when the variable n_event is not 0 (zero) the latency reaches to above 2000ns, otherwise the latency is acceptable. What's more, when I use the function ef_eventq_poll() after the function ef_vi_transmit() synchronously the latency is always tiny, no matter the variable n_event is 0 (zero) or not. Just like the following:

// Acquire a free packet_buffer . (I accomplish it by myself, just use some indexes to control the packet_buffers array.)
while(! acquire_pkt_buf() );

ef_vi_transmit(vi, dma_buf_addr, frame_len, id);
...
n_event = ef_eventq_poll(&vi, evs, sizeof(evs) / sizeof(evs[0]));
...
// If the packet_buffer can be reused, release it. (I accomplish it too.)

But in this way the DMA will run out because I use a packet_buffer (which in a packet_buffers array just as the usage in the document) to transmit and can't release it immediately (because the variable n_event is 0, the code can't reach the release function) and finally lead to infinite loop while( ! acquire_pkt_buf() );.

Finally, I have summarized the following questions:

  • When transmitting, should I run a thread to call the function ef_eventq_poll() cyclically to listen the arrival of events, or I just call it synchronously after the function ef_vi_transmit()
  • If I use the function ef_eventq_poll() synchoronously after the function ef_vi_transmit(), is it normal that I get 0 (zero) from the return of the function ef_eventq_poll()? Is there is a delay between the function ef_vi_transmit_pull() and the function ef_eventq_poll()?
  • I notice that the return of the function ef_vi_transmit_unbundle() is always 1 (one) when the variable n_event is not 0 (zero), and I wonder if there are some ways to improve it? (I call the function ef_vi_transmit() to transmit one packet_buffer every time)

Please excuse me for my poor English. And maybe I use ef_vi in a wrong way, tomorrow I will continue checking my codes.

Good night, and thank you very much for your patient!

Roser-ZY avatar Oct 14 '21 15:10 Roser-ZY

2000ns for ef_eventq_poll() does seem extremely high. You didn't answer my question about your hardware and software setup. I think my best guess at this stage is that you're running on ef_vi's AF_XDP backend and with some kind of network card that has a very slow driver (possibly some kind of paravirtualized card). Are you able to use perf to investigate further where the CPU time is going. Although I should warn you that if you're not using a Xilinx NIC then there's very little I'll be able to do to help you improve latency.

Answering your specific questions:

  • ef_vi does not do any mutexing itself. If you are using the same ef_vi* from multiple threads then it is your responsibility to stop them racing with each other. The typical guidance is that for most apps it's most efficient to run everything in one thread and do away with the mutexes.
  • Yes, completion of a transmit will take some time. At a minimum this is the time taken to stream out the bits on the wire, which you can compute by knowing the packet size and Ethernet speed. There's also the round-trip time across the PCI bus and any event coalescing configuration to consider
  • Event coalescing is NIC-specific, with the current configuration viewable using ethtool -c. The thresholds are usually quite small, so if you're getting 2us delays regularly then it doesn't surprise me that not much coalescing is happening. ef_vi_transmit() actually does two things: ef_vi_transmit_init() which tells the NIC where the data is, and ef_vi_transmit_push() which prods the hardware to tell it that there's some data to send. The second is much slower than the first, so if you can _init() several packets and then _push() once at the end then that'll be a lot quicker.

rhughes-xilinx avatar Oct 14 '21 15:10 rhughes-xilinx

Well, I'm sorry I forget to check the configuration. The following is the NICs:

87 49网卡型号

Although I only have two types of network cards and I do use the second NIC, I still don't know if I'm using Xilinx NIC or AF_XDP and what should I do if I use the wrong hardware. I check the detailed of the NIC I'm using by ethtool -i:

87 49网卡详细信息

Then I checked the configuration using ethtool -c and I got the following:

87 49网卡详细配置

And for the question I have asked, I have some additional descriptions:

  • I only start a thread to poll for events for each Virtual Interface, and for each sending thread I use different Virtual Interfaces.
  • Whether the TX can be reused automatically when transmitting is complete or I should call some functions? And when I use ef_vi_transmit_space() in this way:
      while (!(n_tx_free = ef_vi_transmit_space(&vi)) > 0) {
		cout << "TX Loop" << endl;	
	}

Is it possible that ef_vi_transmit_space() gets a non-zero value and then jumps out of the loop?

Thank you again for your patience. Have a nice day~

Roser-ZY avatar Oct 15 '21 02:10 Roser-ZY

I'm kinda busy today, so I haven't thought as much about possible causes this time:

2000ns when using a Xilinx NIC is entirely unexpected. Is it possible that you have a very strange NUMA configuration or the world's largest multi-socket machine? We only expect cross-NUMA traffic to lose about 200ns, so it'd have to be very odd. It's worth you playing with perf a bit to find out where the cycles are disappearing.

You must call ef_eventq_poll() in order to receive tx-completion events. The while loop you've shown will not work. Please read the relevant sections of the ef_vi user guide (e.g. 040_using.dox:193) and the example efsend.

rhughes-xilinx avatar Oct 15 '21 15:10 rhughes-xilinx

If you are unsure whether the OS and BIOS are suitably set up to avoid scheduling interruptions (which would easily cause a delay of the length of time you're highlighting) then there is a tool called sysjitter you can run which will just spin on a CPU and measure the distribution in times between each loop. This would give an indication about whether the problem is in OS configuration or in the time it takes to execute those ef_vi functions.

https://github.com/Xilinx-CNS/cns-sysjitter

kieranm-xilinx avatar Oct 18 '21 08:10 kieranm-xilinx

OK, thank you very much! I'll check the configuration and follow your advice. Well, I'm a little busy this week too (XD), so maybe I'll figure out the problem next week. Any way, thanks for your patient!

Roser-ZY avatar Oct 19 '21 01:10 Roser-ZY

Well, I change the strategy to ef_pio and I found the TX always being trapped into the dead loop. I have used the ef_eventq_poll(), but it seems that the TX sometimes isn't released correctly.

The udp_send() function is following:

void SFUtil::udp_send(const char* msg, const size_t msg_size) {
	if (!is_init) {
		cout << "Send packet failed!\nPlease initialize SFUtil first.\nCall the function: init()." << endl;
		return;
	}
	// Error code when failed.
	int err = 0;
	std::error_code err_code;

	// Initialize the amount of sending and possible spaces.
	n_tx_free = 0;

	// Initialize the FINAL length of the packet.
	pkt_len = 0;

	// Available Packet Buffer.
	packet_buffer* pkt_buf = accquire_buffer();
	// Debug
	dbgprintf("Get an available DMA, whose id is %d.\n", pkt_buf->id);

	// Assemble UDP packet.
	pkt_len = assemble_udp_packet(msg, msg_size, pkt_buf);
	dbgprintf("The packet buffer length is %d", pkt_len);
	/*cout << pkt_buf->buf << endl;*/
	cout << (uint8_t*) pkt_buf->udp << endl;
	dbgprintf("The packet buffer length is 0x%08x", *(uint32_t*)pkt_buf->buf);
	// Debug
	dbgprintf("Assembling to UDP packet completed!\n");
	// Loop until TX has spaces.
        /*******************Dead loop here*******************/
	while (!(n_tx_free = ef_vi_transmit_space(&vi)) > 0);
	// Debug
	/*dbgprintf("Get an available TX descriptor.\n");*/
	if (err = ef_vi_transmit_copy_pio(&vi, 0, (char*)pkt_buf->buf, pkt_len, pkt_buf->id)) {
		err_code.assign(~err | 0x01, err_code.category());
		printf("Message transmit failed. Error code -%d: %s.\n", err_code.value(), err_code.message().c_str());
		abort();
	}
	dbgprintf("Message is now in transmit queue.\n");
}

And the event handle function is following:

void SFUtil::feedback(void) {
	ef_request_id ids[EF_VI_TRANSMIT_BATCH];
	ef_event events[EVENT_BATCH_SIZE];
	int n_event = 0;
	int n_completed = 0;
	int n_error = 0;
	int i;

// Debug
#ifdef DEBUG
	uint32_t msg_count = 0;
	dbgprintf("\n\n\nCall feedback!\n\n\n");
#endif

	while (true) {
		n_event = ef_eventq_poll(&vi, events, sizeof(events) / sizeof(events[0]));
		if (n_event > 0) {
			for (i = 0; i < n_event; ++i) {
				switch (EF_EVENT_TYPE(events[i])) {
				case EF_EVENT_TYPE_TX:
					n_completed = ef_vi_transmit_unbundle(&vi, &events[i], ids);
					if (n_completed > 0) {
						release_buffer(n_completed);
					}

					// Debug
					#ifdef DEBUG
					++msg_count;
					printf("\tSending %d message(s).\n\tThere are totally %d message(s) has been sent.\n\n", n_completed, msg_count);
					#endif
					
					break;
				case EF_EVENT_TYPE_TX_ERROR:
					n_error = ef_vi_transmit_unbundle(&vi, &events[i], ids);
					if (n_error > 0) {
						printf("Warning: There is(are) %d transmition(s) failed.\n");
					}
					break;
				default:break;
				}
			}
		}
	}
}

What's more, the reason of high latency haven't been found out, I've tried bind the thread to a cpu core but it didn't work. High latency is intermittent. If necessary, I can give you the complete codes.

Thank you very much! Han.

Roser-ZY avatar Nov 17 '21 08:11 Roser-ZY

You don't say it explicitly, but the only way I can imagine that code partially working is if you're running udp_send() and feedback() on two different threads concurrently. That's incorrect: ef_vi has no mutexes internally, so it's a race.

You are expected to supply the locking yourself if using the same ef_vi instance from multiple threads. However that's not how we recommend the API be used; you should try to have one instance per application thread and hence the cost of the atomic instructions isn't needed. An outline of the modification to your code above would be to remove the while(true) loop from feedback() and adjust the "dead loop" in udp_send() to be something like:

while (!(n_tx_free = ef_vi_transmit_space(&vi)) > 0)
  feedback();

rhughes-xilinx avatar Nov 17 '21 10:11 rhughes-xilinx

OK, thanks for you advice and I'll change my codes. By the way, if I want to sent packets with Stream Transfer, which transmit should I use? PIO? CTPIO? Will it affect the latency if I process events synchronously just like the way you showed? And what's more, I find that the high latency ( above 1000ns ) will appear even if I do nothing:

start = get_realTime();
// Nothing
stop = get_realTime();
cout << stop - start << endl;

and I suppose the reason is the CPU switch in multi-thread. But the latency is still high ( above 1000ns ) even if I bind CPU core with the thread. Your suggestion inspired me, I'll try it.

Thank you very much! Han.

Roser-ZY avatar Nov 18 '21 02:11 Roser-ZY

OK, thanks for you advice and I'll change my codes. By the way, if I want to sent packets with Stream Transfer, which transmit should I use? PIO? CTPIO? Will it affect the latency if I process events synchronously just like the way you showed? And what's more, I find that the high latency ( above 1000ns ) will appear even if I do nothing:

start = get_realTime();
// Nothing
stop = get_realTime();
cout << stop - start << endl;

and I suppose the reason is the CPU switch in multi-thread. But the latency is still high ( above 1000ns ) even if I bind CPU core with the thread. Your suggestion inspired me, I'll try it.

Thank you very much! Han.

I have tried, and there is an interesting result.

        // Loop until TX has spaces.
	while (!(n_tx_free = ef_vi_transmit_space(&vi) > 0)) {
		++tx_count;
		feedback();
		cout << "TX Loop: " << tx_count << endl;
		if (tx_count == 300) {
			cout << "TX Free Count: " << tx_free_count << endl;
			abort();
		}
	}
	tx_free_count += n_tx_free;

	if (tx_count > 0) {
		cout << "TX Free: " << n_tx_free << endl;
	}

Every time the feedback() is called, there is some events received. However, the return of ef_vi_transmit_space(&vi) is always 1. After running for a while the return of ef_vi_transmit_space(&vi) is 0 and the event number is also 0, and the program trapped into DEAD LOOP. What's more, the latency still sometimes reaches to above 1000ns, but now I'm sure the latency is not about the ef_vi. I just delete all ef_vi codes and I find the latency also sometimes reaches to above 1000ns( even 2000ns ), and I don't know why. o(╥﹏╥)o

Sincerely, Han.

Roser-ZY avatar Nov 18 '21 06:11 Roser-ZY

By the way, if I want to sent packets with Stream Transfer, which transmit should I use? PIO? CTPIO? Will it affect the latency if I process events synchronously just like the way you showed?

DMA (ef_vi_transmit(), etc) prioritizes CPU usage over latency, so it's what you want if you're streaming a lot of packets. The other tools achieve their lower latency by making the CPU do more work.

See the efsend example app for how to fire out packets as fast as possible, interleaving the event polling with the sends.

I find that the high latency ( above 1000ns ) will appear even if I do nothing

We have a handy tool https://github.com/Xilinx-CNS/cns-sysjitter which will do more-or-less the same thing but records useful stats and stuff. I suggest trying it, and using instructions like https://access.redhat.com/sites/default/files/attachments/201501-perf-brief-low-latency-tuning-rhel7-v2.1.pdf to fine-tune your setup (you may not be running Red Hat, but the majority of the tips in there are generic).

and the program trapped into DEAD LOOP

I can't immediately see the bug here - you should add logging in feedback's default case for event types you don't expect - the NIC may be reporting an error from the send.

rhughes-xilinx avatar Nov 19 '21 10:11 rhughes-xilinx

Well, I don't understand why ef_vi_transmit_space() always gets 0 after I run the app for a while. I have handle the error event:

switch (EF_EVENT_TYPE(events[i])) {
				case EF_EVENT_TYPE_TX:
					n_completed = ef_vi_transmit_unbundle(&vi, &events[i], ids);
					if (n_completed > 0) {
						release_buffer(n_completed);
					}

					// Debug
					#ifdef DEBUG
					++msg_count;
					printf("\tSending %d message(s).\n\tThere are totally %d message(s) has been sent.\n\n", n_completed, msg_count);
					#endif
					
					break;
				case EF_EVENT_TYPE_TX_ERROR:
					n_error = ef_vi_transmit_unbundle(&vi, &events[i], ids);
					if (n_error > 0) {
						printf("Warning: There is(are) %d transmition(s) failed.\n");
					}
					break;
				default:break;

Is there are other events indicating something wrong when sending packets? I really needs help. :-( Han.

Roser-ZY avatar Nov 26 '21 09:11 Roser-ZY

I don't understand why ef_vi_transmit_space() always gets 0 after I run the app for a while

I don't have any good ideas, but are you determining that it's 'stuck' solely by if (tx_count == 300)? I think it's conceivable that the CPU can go round that loop 300 times between events. The NIC will batch tx completion events using a 15us timer, so 300 loop iterations is probably achievable (the unknown being the cost of that write to cout).

rhughes-xilinx avatar Nov 26 '21 10:11 rhughes-xilinx