uhd
uhd copied to clipboard
X310 fails with "x300 fw poke32 - reply timed out"
Issue Description
During runtime we sometimes get the following reported in the console:
SSSSU[ERROR] [X300] 192.168.40.2: x300 fw communication failure #1
EnvironmentError: IOError: x300 fw poke32 - reply timed out
Afterwards all calls to tx_stream->send()
times out and no data getting transmitted (the send function returns 0 after 100ms).
Setup Details
utils/uhd_usrp_probe --args addr=192.168.40.2
[INFO] [UHD] linux; GNU C++ version 10.3.1 20210422 (Red Hat 10.3.1-1); Boost_106600; UHD_4.2.0.HEAD-0-g46a70d85
[INFO] [X300] X300 initialization sequence...
[INFO] [X300] Maximum frame size: 8000 bytes.
[INFO] [X300] Radio 1x clock: 200 MHz
_____________________________________________________
/
| Device: X-Series Device
| _____________________________________________________
| /
| | Mboard: X310
| | revision: 11
| | revision_compat: 7
| | product: 30818
...
| | FW Version: 6.0
| | FPGA Version: 38.0
| | FPGA git hash: 8daa80c
| | RFNoC capable: Yes
Expected Behavior
X310 should not stop sending data, or should recover and start sending data again.
Actual Behaviour
The error is reported and sending data stops completely.
Steps to reproduce the problem
The issue can be reproduced using the "tx_waveforms" example and iperf sending data to the device.
- Run the tx_waveforms example
./examples/tx_waveforms --rate 10e6 --freq 1e6 --nsamps 100000000 --args="type=x300,addr=192.168.40.2"
- Send iperf data to the device
iperf -c 192.168.40.2 -u -b 1000m -t 1 -p 1234
- Observe that the application never exits (
--nsamps
is never reached sincetx_stream->send()
returns 0).
Additional Information
Using iperf is just a convenient way to reproduce an issue that we see sporadically during "normal" operation.
Edit: After testing it became clear that the send()
function times out after the timeout
period expired.
Is there a sensible way to detect this, and then recover? I have tested with the following and it seems to work, but is it correct or is there a better option?
if(nr_send == 0)
{
tx_stream.reset();
std::this_thread::sleep_for(std::chrono::milliseconds(10));
tx_stream = usrp->get_tx_stream(stream_args);
}
You have a valid issue that I'd love to see get fixed. In my experience, the issue more broadly means there's something going on with networking between the host computer and the X310. That said, if UHD could reset the USRP's networking as you note -- and I don't know if that good code or not -- then streaming might be able to resume. -That- said, check the networking to make sure it is robust: try a direct connection if you're using a switch between the host computer and the USRP; try different cables -- ENET or DAC or fiber; try different adapters if ENET or fiber. Try a different NIC on the host computer, or a different computer with a similar NIC. It's likely that with all of these checks something will come up as not working correctly.
@michael-west @wordimont what do you think of this code change? Is there another way to reset the streaming to allow data to flow again when this issue happens?
I don't know if there's a better way to detect and recover, but I'm not super familiar with what options the API provides. I'm curious if we can reproduce this or if it really is just an unreliable connection like you suggested.
@CJCombrink how quickly does this occur when running tx_waveforms with iperf?
@wordimont It happens immediately after I run iperf.
Any update on this perhaps?
More findings:
If we call get_tx_stream
immediately after send()
returns zero we get the following exception:
Error: EnvironmentError: IOError: Timed out getting recv buff for management transaction
(as per the code in my previous comment)
For it to actually work I need a delay between the time that send()
returns zero and I call the restart code
if(nr_send == 0)
{
std::this_thread::sleep_for(std::chrono::milliseconds(1000));
tx_stream.reset();
std::this_thread::sleep_for(std::chrono::milliseconds(10));
tx_stream = usrp->get_tx_stream(stream_args);
}
(almost anything less than the above 1 seconds sleep results in the exception). Edit: It appears that any one of the two delays shown can be 1second then the reset will work
Running iperf
in the way you are describing it will most likely crash the ZPU (I think). That will shut down your device and the x300 fw poke32 - reply timed out
is then the expected result.
Now I realize that you are obviously not running iperf
in normal operation, but I wonder if you have a network configuration that causes a lot of spurious traffic to slam into the X310. I'm not certain this is what's happening, or what such a network setup would look like, but there may be a difference between your setup and most other people's setup.