Performance issues in XEmacPs_BdRingFromHwTx
In our application on a Ultrascale+ RPU we have task which should be run every 100µs, which was not triggered reliably. We traced it back to XEmacPs_BdRingFromHwTx taking well over 200µs.
We have observed performance issues when using large TX ringbuffers (1024 in our case). Lowering the BdLimit parameter was without any changes to the timing, which lead us to examine the function more closely.
The Issue
When less than than BdLimit BDs were consumed by the hardware, the code may scan the whole ringbuffer. Wasting many precious cycles while doing so.
Worst case timings, ringbuffer size 1024 and BdLimit=64 on an Ultrascale+ RPU
- without fix: ~250µs
- with proposed fix: ~20µs
The XEmacPs_BdRingFromHwTx function is called within a SYS_ARCH_PROTECT block, which disables all interrupts including the one interrupt triggering our task.
Detailed description
Looking at the code for the function (see below), we see the while loop has two abort conditions 1) handle BdLimitLoc used BDs at most and 2) stop when reaching RingPtr->HwTail
For example when no BD is used there is no reason for the loop to abort earlier than reaching RingPtr->HwTail. As per my understanding used BDs will contiguous in the ringbuffer, making the iterations over unused BDs after the first unused pointless.
u32 XEmacPs_BdRingFromHwTx(XEmacPs_BdRing * RingPtr, u32 BdLimit,
XEmacPs_Bd ** BdSetPtr)
{
XEmacPs_Bd *CurBdPtr;
u32 BdStr = 0U;
u32 BdCount;
u32 BdPartialCount;
u32 Sop = 0U;
u32 Status;
u32 BdLimitLoc = BdLimit;
CurBdPtr = RingPtr->HwHead;
BdCount = 0U;
BdPartialCount = 0U;
/* If no BDs in work group, then there's nothing to search */
if (RingPtr->HwCnt == 0x00000000U) {
*BdSetPtr = NULL;
Status = 0U;
} else {
if (BdLimitLoc > RingPtr->HwCnt){
BdLimitLoc = RingPtr->HwCnt;
}
/* Starting at HwHead, keep moving forward in the list until:
* - A BD is encountered with its new/used bit set which means
* hardware has not completed processing of that BD.
* - RingPtr->HwTail is reached and RingPtr->HwCnt is reached.
* - The number of requested BDs has been processed
*/
while (BdCount < BdLimitLoc) {
/* Read the status */
if(CurBdPtr != NULL){
BdStr = XEmacPs_BdRead(CurBdPtr,
XEMACPS_BD_STAT_OFFSET);
}
if ((Sop == 0x00000000U) &&
((BdStr & XEMACPS_TXBUF_USED_MASK)!=0x00000000U)){
Sop = 1U;
}
if (Sop == 0x00000001U) {
BdCount++;
BdPartialCount++;
}
/* hardware has processed this BD so check the "last"
* bit. If it is clear, then there are more BDs for
* the current packet. Keep a count of these partial
* packet BDs.
*/
if ((Sop == 0x00000001U) &&
((BdStr & XEMACPS_TXBUF_LAST_MASK)!=0x00000000U)) {
Sop = 0U;
BdPartialCount = 0U;
}
/* Move on to next BD in work group */
CurBdPtr = XEmacPs_BdRingNext(RingPtr, CurBdPtr);
/* Reached the end of the work group */
if (CurBdPtr == RingPtr->HwTail) {
break;
}
}
/* Subtract off any partial packet BDs found */
BdCount -= BdPartialCount;
/* If BdCount is non-zero then BDs were found to return.
* Set return parameters, update pointers and counters,
* return success
*/
if (BdCount > 0x00000000U) {
*BdSetPtr = RingPtr->HwHead;
RingPtr->HwCnt -= BdCount;
RingPtr->PostCnt += BdCount;
XEMACPS_RING_SEEKAHEAD(RingPtr, RingPtr->HwHead, BdCount);
Status = (BdCount);
} else {
*BdSetPtr = NULL;
Status = 0U;
}
}
return Status;
}
Proposed fix
Add an additonal break condition to the loop, when the first unused BD is encountered. As the used BD should be contiguous in the ringbuffer, I see no problems when the loop is aborted earlier.
// code cut for clarity
while (BdCount < BdLimitLoc) {
/* Read the status */
if(CurBdPtr != NULL){
BdStr = XEmacPs_BdRead(CurBdPtr,
XEMACPS_BD_STAT_OFFSET);
}
if ((BdStr & XEMACPS_TXBUF_USED_MASK)==0x00000000U){ // <- add additional break condition
break; // <- add additional break condition
} // <- add additional break condition
if ((Sop == 0x00000000U) &&
((BdStr & XEMACPS_TXBUF_USED_MASK)!=0x00000000U)){
Sop = 1U;
}
// code cut for clarity
Thanks for reporting this. We'll reproduce the issue and look into it.