verilog-ethernet Strange issue with VCU118

Hi,

thanks for this super nice project! I have a design that extends your VCU118 1g example. It worked perfectly for over one year now (on a VCU118 board). Last month, I bought 5 new VCU118 boards. One of the boards runs perfectly fine in 100% of the cases but the other 4 boards never responded to my UDP packets. The 4 boards seem do be ok in general, since I tried a version of my design that uses UART instead of the ethernet mac and it works well (but is too slow). Then, I wrote your VCU 1g loopback example to the 4 boards and observed that 2 of the boards respond correctly in the first 2-3 seconds directly after writing the bitstream. After that, they stop responding. I never ever got an answer from the remaining 2 boards. I swaped cables and ports of the switch to rule out any problem with the physical connection. Moreover, my linux kernel reports a running 1g connection with RX flow control for all boards when I connect them directly (without a switch). So the ethernet connection is up and negotiated at 1g. All in all, this raises the following two questions:

1.) How is it even possible that the very same bitstream file shows such a different behavior over 6 boards?

2.) The description of the open Ultrascale+ GTY Reset issue #64 reminds me somewhat of my problem--can they be related? Could it be that I need a "push-button reset"? I never tried that out, because of Corona, I cannot enter my office right now. So I do not have physical access to the devices (I can cold restart them via an usb controllable power supply, though)

Do you have any idea? Do you need any additional information? Thanks and best, Nico

Feb 17 '21 22:02 np84

That's perplexing. This will be completely unrelated to #64 as the interface to the PHY chip is done using IOSERDES/bitslice primitives and not with the GTY transceivers. On the VCU118, the Xilinx SGMII IP core is used to interface with the PHY, so I would recommend isolating the problem to my code (UDP stack, MAC, etc.) vs. Xilinx code (SGMII), and if it's Xilinx code then ask about it on the Xilinx forums. It's also possible there is something screwy going on with the PHY chip - can you check the part numbers on the PHY chip to see if maybe the boards have different revisions? I previously had a problem with a USB serial chip that refused to work on one particular board, turns out that the chip was a different silicon revision and something had been changed with respect to flow control.

Feb 20 '21 07:02 alexforencich

Checking the part numbers on the PHY chips is a good idea. I will do that next week. Do you have any suggestion how to isolate the problem? I have no experience with things like that. A few words about the current setup: Right now, the 5 new boards are connected via USB JTAG and USB UART to a host machine. Four boards (including the one that always works) are connected via an ethernet switch to the host. One board is connected directly via ethernet to the host. I saw in your loopback example some UART related things---do you have something like a "debug version" of the example that provides additional information over UART which could help me to track the problem down to either the SGMII or your code? Thanks!

Feb 20 '21 11:02 np84

Maybe I have found something:

The PHY chip of the board that always works says: DP83867IS TI96I ASL4 G4 The other 4 boards which do not work all say: DP83867IS TI95I C1KQ G4

Any idea what to do with this information?

Feb 22 '21 09:02 np84

Interesting! Start with the errata from TI, maybe they documented a change somewhere. Hopefully the fix is as simple as poking a register on the PHY chip via MDIO, and the code to do that is already in the example design.

Feb 22 '21 10:02 alexforencich

Update: I have reported the problem in the TI forum and it seems that they are investigating the issue. However, they are starting to ask things like "is the connection confirmed on the PHY side by reading register 0x1?". Of course, I do not know this. Is there any easy way to get this kind of information, i.e., interact with the PHY over UART or JTAG? I can of course try to build my own design that reads out such things from the PHY and sends them back to the host via UART. But I would really love to not "re-invent" the wheel here. Don't you have a debug-bitstream or something similar that can do this and that you can share with me?

Feb 24 '21 13:02 np84

I don't have a canned solution for that, unfortunately. Best I can suggest is perhaps to use https://github.com/alexforencich/xfcp and then use wb_mdio_master.zip to connect that to the PHY. I'm not sure if the MDIO master module in this repo is the same as the one used in wb_mdio_master, so I included it for reference. I will note that this was used on a 10G board and there are some addressing differences between 1G and 10G MDIO, so some adjustments may be necessary. With that setup, you should be able to poke at the PHY from Python via the USB UART on the board.

Feb 25 '21 07:02 alexforencich

Ok, thanks! I will try my best to build something based on that code.

Feb 25 '21 11:02 np84

Hi Alex,

your xfcp code is really great, thanks! I started to build an xfcp_mod_mdio based on your xfcp_mod_wb. I managed to create a new node and connect it to your XFCP switch. Right now, I am implementing indirect access to the DP83867IS registers. It should then be possible to read and write the registers via your pretty nice python interface.

But before that, I want to run a small loopback test: I write the content of addr_reg into the data_reg (variable names match those from your xfcp_mod_wb code). When I provide an even address on the command line of xfcp_ctrl.py, I get exactly this address back, e.g.: "python3.8 ./xfcp_ctrl.py -p /dev/ttyUSB1 --read 4 2 1" returns "02". This is perfectly fine. However, when I access an odd address, e.g., "python3.8 ./xfcp_ctrl.py -p /dev/ttyUSB1 --read 4 3 1" then I get "00" as an answer. When I read more bytes, say via "python3.8 ./xfcp_ctrl.py -p /dev/ttyUSB1 --read 4 3 4", then it return "00 05 00 07". I guess that I am doing something wrong, but I do not get what exactly.

The "configuration" that I use is

parameter COUNT_SIZE = 16, parameter DATA_WIDTH = 16, parameter ADDR_WIDTH = 16

I played with different values of COUNT_SIZE but the results is the same. Any ideas? Thanks, Nico

Apr 13 '21 15:04 np84

That's what I would expect. Basically, the way it works is the address you specify is always a byte address. I did this differently before (word address instead of byte address) but the code to support that was needlessly complicated and it caused a number of headaches. Anyway, if the interface width (DATA_WIDTH) is larger than a byte, then address 0 is the LSB, address 1 is the next higher byte, etc. So with a 16 bit width, when you read even addresses, you get the low order byte of the data, and when you read an odd address, you get the high order byte. So with your setup if you read from address 0x0002, you get 0x00[02] -> 0x02, and if you read from address 0x0003, you get 0x[00]03 -> 0x00. Now, if the STRB signal does not imply byte granularity operations, then reads are not affected (it still reads assuming byte granularity) but writes are zero-padded. If you want a 16-bit-word-based interface, then set STRB_WIDTH = 1 and make sure you only access even addresses and read/write bytes 2 at a time, or use the word access wrappers with even addresses.

Try reading from, say, address 0x901 or something like that and see what you get.

Now, the only thing here that could potentially be regarded as a bug is that the low order address bits are not masked off when incrementing - you read 4 bytes starting from address 3, and it issued reads with address set to 3, 5, and 7 instead of 3, 4 and 6. I should probably fix that, but the low order address bits are usually going to be ignored by downstream logic anyway.

Apr 13 '21 21:04 alexforencich

I finally managed to get the code running and now, I can read and write all the registers of the TI PHY via xfcp! After going through all the registers on all five devices, I can see that all devices established a valid ethernet link. However, only device 1 (the one that actually works) reports "SGMII Auto-Negotiation process complete" (bit 0 of register 0x0037). The other 4 devices (those which do not respond to UDP packets) report "SGMII Auto-Negotiation process not complete".

If I interpret this correctly, this means that the SGMII connection between the PHY and the MAC could not be established. Any ideas how we can fix this? I see in the PHY docs that SGMII Auto-Negotiation can be disabled---can this help? Thanks and best, Nico

Apr 19 '21 23:04 np84

TBH, I don't know what to tell you here, besides either the PHY or the Xilinx SGMII core is not doing what it is supposed to do. Could be a bug in the PHY, a misconfigured register in the PHY, a bug in the Xilinx SGMII core, or something completely different. Also, just because the chips report all the same register values doesn't mean that the register values are correct, it's possible that one of the default values is incorrect and perhaps TI changed something wrt. how the chip implements that feature. I had to do some guesswork to figure out what values to write with the MDIO init code, it's entirely possible that one of those values is not correct, or that more registers need to be initialized.

I think the only thing that we have learned so far is that the DP83867 PHY chip is more terrible than I originally thought and Xilinx should really just stick to using Marvell PHY chips on all of their dev boards.

Apr 20 '21 07:04 alexforencich

Ok, thanks. I already played around with the registers but had no success yet. Let us see if the TI people can help me here. I will also try to get some debug information from the Xilinx SGMII core. Any hint how to do this?

Apr 20 '21 08:04 np84

I investigated the status vector of the Xilinx SGMII core and it reports RUDI(INVALID), RXDISPERR, and RXNOTINTABLE errors. I start to believe that those devices have some serious problems and that I will not be able to fix this. I'm still in contact with Xilinx and TI but I do not expect that they can solve it.

Apr 25 '21 22:04 np84

Hi @np84, I am running into a similar problem as you and have seen your posts on the TI forum. Did you ever end up reaching any sort of conclusion as to what caused the issue?

Jun 21 '21 18:06 erikdanie

Hi @erikdanie, there was a discussion with TI engineers via e-mail. They revealed that the marking on the chip is not a revision/version number but a production date---so according to TI, all chips must be identical. They agreed with me that something must be faulty---most likely the SGMII connection between PHY and FPGA. Their final recommendation was A-B-A-swap testing. but I of course do not want to disassemble my VCU118 boards. My workaround is to use one QSFP28 port instead of RJ45. This works like charm! :)

Jun 21 '21 18:06 np84

I had the same problem and after using these constraints for lvds input pins, it was solved: set_property DIFF_TERM_ADV TERM_100 [get_ports phy_sgmii_rx_p] set_property DIFF_TERM_ADV TERM_100 [get_ports phy_sgmii_rx_n] set_property DIFF_TERM_ADV TERM_100 [get_ports phy_sgmii_clk_p] set_property DIFF_TERM_ADV TERM_100 [get_ports phy_sgmii_clk_n]

I think without termination because of back reflections of the lines, the signals quality degrade. There is no termination for lvds signals in pcb. I hope this helps.

Sep 15 '21 11:09 jalilisahar

Thanks a lot, @jalilisahar! Setting these constraints solved the issue for me, too. I had the same strange behavior as described by @np84. I tested 6 VCU118 boards and 1 board didn't work. Btw., the Xilinx user guide PG047 (Eth PCS/PMA SGMII) recommends to set even more constraints on the LVDS pins. In my case, adding the termination properties helped already.

Apr 29 '22 15:04 SHaBark

I'm so glad that it helped. @SHaBark Good Luck!

May 10 '22 04:05 jalilisahar

@jalilisahar I have been using @alexforencich's design combined with MATLAB AXI manager for a few months with two FPGAs. It was working fine so far. Recently, I scaled up to 5 FPGAs and one was having frequent ping timeout issue. Around 2% packets were lost. The rest 4 was fine. Your solution worked like magic. Thanks a ton!

May 09 '24 05:05 navidaadit

I completely forgot about this; I'll take a look at all of the constraints files as it's likely both the VCU108 and VCU118 need this change.

May 09 '24 17:05 alexforencich

@navidaadit I'm happy to hear this. Good luck!

May 15 '24 01:05 jalilisahar

@jalilisahar @SHaBark @alexforencich I still face some issues with the Ethernet design in VCU118. Out of my 5 FPGAs, sometimes one random FPGA pings slow, and often times out when I call it from MATLAB. Reprogramming the FPGA resolves this issue. It appears as: if I get bad luck during programming, it will appear intermittently and repeatedly. And if I get good luck during programming, it never ever fails at all. Totally depending on first-time programming, good luck. Any clue what can go wrong or what can I test? Thanks!

Aug 22 '24 11:08 navidaadit

@navidaadit When I used VC118, After setting lvds constraints, I had no problem. I really don't know what should cause this for your boards but there might be some hardware issues in your boards. Now, I don't use this board but if I remember anything useful, I will share it with you.

Aug 27 '24 05:08 jalilisahar