uhd icon indicating copy to clipboard operation
uhd copied to clipboard

X310 fails with "x300 fw poke32 - reply timed out"

Open CJCombrink opened this issue 1 year ago • 8 comments

Issue Description

During runtime we sometimes get the following reported in the console:

SSSSU[ERROR] [X300] 192.168.40.2: x300 fw communication failure #1
EnvironmentError: IOError: x300 fw poke32 - reply timed out

Afterwards all calls to tx_stream->send() times out and no data getting transmitted (the send function returns 0 after 100ms).

Setup Details

utils/uhd_usrp_probe --args addr=192.168.40.2
[INFO] [UHD] linux; GNU C++ version 10.3.1 20210422 (Red Hat 10.3.1-1); Boost_106600; UHD_4.2.0.HEAD-0-g46a70d85
[INFO] [X300] X300 initialization sequence...
[INFO] [X300] Maximum frame size: 8000 bytes.
[INFO] [X300] Radio 1x clock: 200 MHz
  _____________________________________________________
 /
|       Device: X-Series Device
|     _____________________________________________________
|    /
|   |       Mboard: X310
|   |   revision: 11
|   |   revision_compat: 7
|   |   product: 30818
...
|   |   FW Version: 6.0
|   |   FPGA Version: 38.0
|   |   FPGA git hash: 8daa80c
|   |   RFNoC capable: Yes

Expected Behavior

X310 should not stop sending data, or should recover and start sending data again.

Actual Behaviour

The error is reported and sending data stops completely.

Steps to reproduce the problem

The issue can be reproduced using the "tx_waveforms" example and iperf sending data to the device.

  1. Run the tx_waveforms example
    ./examples/tx_waveforms  --rate 10e6 --freq 1e6 --nsamps 100000000 --args="type=x300,addr=192.168.40.2"
    
  2. Send iperf data to the device
    iperf -c 192.168.40.2 -u -b 1000m -t 1 -p 1234
    
  3. Observe that the application never exits (--nsamps is never reached since tx_stream->send() returns 0).

Additional Information

Using iperf is just a convenient way to reproduce an issue that we see sporadically during "normal" operation.

Edit: After testing it became clear that the send() function times out after the timeout period expired.

CJCombrink avatar Jul 14 '22 13:07 CJCombrink

Is there a sensible way to detect this, and then recover? I have tested with the following and it seems to work, but is it correct or is there a better option?

if(nr_send == 0)
{
    tx_stream.reset();
    std::this_thread::sleep_for(std::chrono::milliseconds(10));
    tx_stream = usrp->get_tx_stream(stream_args);
}

CJCombrink avatar Jul 14 '22 15:07 CJCombrink

You have a valid issue that I'd love to see get fixed. In my experience, the issue more broadly means there's something going on with networking between the host computer and the X310. That said, if UHD could reset the USRP's networking as you note -- and I don't know if that good code or not -- then streaming might be able to resume. -That- said, check the networking to make sure it is robust: try a direct connection if you're using a switch between the host computer and the USRP; try different cables -- ENET or DAC or fiber; try different adapters if ENET or fiber. Try a different NIC on the host computer, or a different computer with a similar NIC. It's likely that with all of these checks something will come up as not working correctly.

michaelld avatar Jul 18 '22 18:07 michaelld

@michael-west @wordimont what do you think of this code change? Is there another way to reset the streaming to allow data to flow again when this issue happens?

michaelld avatar Jul 18 '22 19:07 michaelld

I don't know if there's a better way to detect and recover, but I'm not super familiar with what options the API provides. I'm curious if we can reproduce this or if it really is just an unreliable connection like you suggested.

@CJCombrink how quickly does this occur when running tx_waveforms with iperf?

wordimont avatar Jul 18 '22 19:07 wordimont

@wordimont It happens immediately after I run iperf.

CJCombrink avatar Jul 19 '22 05:07 CJCombrink

Any update on this perhaps?

CJCombrink avatar Aug 01 '22 11:08 CJCombrink

More findings: If we call get_tx_stream immediately after send() returns zero we get the following exception:

Error: EnvironmentError: IOError: Timed out getting recv buff for management transaction

(as per the code in my previous comment)

For it to actually work I need a delay between the time that send() returns zero and I call the restart code

if(nr_send == 0)
{
    std::this_thread::sleep_for(std::chrono::milliseconds(1000));
    tx_stream.reset();
    std::this_thread::sleep_for(std::chrono::milliseconds(10));
    tx_stream = usrp->get_tx_stream(stream_args);
}

(almost anything less than the above 1 seconds sleep results in the exception). Edit: It appears that any one of the two delays shown can be 1second then the reset will work

CJCombrink avatar Aug 01 '22 14:08 CJCombrink

Running iperf in the way you are describing it will most likely crash the ZPU (I think). That will shut down your device and the x300 fw poke32 - reply timed out is then the expected result.

Now I realize that you are obviously not running iperf in normal operation, but I wonder if you have a network configuration that causes a lot of spurious traffic to slam into the X310. I'm not certain this is what's happening, or what such a network setup would look like, but there may be a difference between your setup and most other people's setup.

mbr0wn avatar Dec 12 '23 18:12 mbr0wn