litepcie icon indicating copy to clipboard operation
litepcie copied to clipboard

PCIe issues on ADRV2CRR-FMC

Open smunaut opened this issue 3 years ago • 2 comments

Issue

Trying to bring up PCIe (gen3 4x and gen3 8x) on this board yielded some unexpected issues and it took some time to find a sequence that works.

I'm documenting here the observations, the theory about what I think the problems are and workarounds.

Test Setup

First description of the setup :

  • ADRV2CRR-FMC carrier with ADRV9009-ZU11EG plugged in
  • Asus H510-K motherboard with Intel i3-10105
    • PCIe1 is the 16x slot which is connected directly to the CPU
    • PCIe3 is the 1x slot which comes from the H510 chipset
  • The "PCIe extender" I refer to in some tests below uses a USB3 A-A cable abused to transport PCIe signals from a 1x pcie->usb3 stub inside the computer to an external PCIe 16x slot (mechanical 16x, electrically it's 1x).
  • The "external PCIe switch" I also refer to in some tests below is based on PI7C9X2G404. It has three 1x downstream slots and is connected via the same method as the extender above.

Initial Observation ( Feb 22 )

  • Card plugged in mobo PCIe1 (16x)

    • Only detected if bitstream is loaded after system is booted and trigger rescan
    • Not detected if booted with bitstream loaded
    • Doesn't actual work ( trying to show ID report \xff.... )
    • Link speed reported as 2.5GT/s (gen1)
    • Link width is correct
  • Card plugged in mobo PCIe3 (1x) via a 'usb cable extender'

    • Needs to limit link speed to gen2. In gen3, tons of errors are reported.
    • Card detected fine and ID report is correct, dma_test works -> It works !
    • Link speeds reported as 5GT/s (expected, limited on purpose in bios)
    • Link width is 1x (expected ... that slot it 1x only)
  • Card plugged in an external PCIe switch, the PCIe switch connected to mobo PCIe3

    • PCIe switch shows up correcty with expected link speed
    • Card behavior is the same as when it's plugged in PCIe1 directly (the mobo x16 slot): i.e. not working
    • (note: The PCIe presence jumper needs to be set to 1x or the pcie switch doesn't see the card)
  • Card plugged in an external PCIe switch, the PCIe switch connected to mobo PCIe1

    • Same behavior as if switch plugged in PCIe1
  • Card plugged in mobo PCIe1 (16x) via a 'usb cable extender' (limiting to 1x)

    • Same behavior as direct connection expect link width is 1x

Theories about problems ( Feb 23 )

Following more testing the next day, I think there are several problems and that's why the symptoms are weird and the different cases results make little sense.

  • If the card isn't detected at boot, the bios seems to not bother to configure the PCIe root port. So I have to manually write the LinkControl register in linux to set it up properly to get gen3 support and get it to train properly.

  • When doing a pcie rescan, even if it detects the device ... linux is dumb as a brick and will not configure the memory zone through the upstream switches/root ports, they remain [disabled]. Solution for that is to do a delete of the pcie root port where the card is plugged, and then do a rescan. When linux will re-add the bridge, it will then properly configure it for the downstream devices.

  • For some reason, once the PCIe core has trained once ... it cannot go through a reboot cycle, that will lock it up. So if I configure it before the machine is started, then boot, it's fine. Or if I boot with the card unconfigured and do it all once in linux, that works too. But if I get the card up and try a warm boot, it will never be seen ever again.

smunaut avatar Feb 24 '22 09:02 smunaut

So I tried tracing the various state it goes through during several events :

Initial configuration ( with PC already booted )

  • ltssm=00 (Detect.Quiet) [ initial ]
  • ltssm=01 (Detect.Active)
  • ltssm=02 (Polling.Active)
  • ltssm=04 (Polling.Configuration)
  • ltssm=05 (Configuration.Linkwidth.Start)
  • ltssm=06 (Configuration.Linkwidth.Accept)
  • ltssm=08 (Configuration.Lanenum.Wait)
  • ltssm=07 (Configuration.Lanenum.Accept)
  • ltssm=09 (Configuration.Complete)
  • ltssm=0a (Configuration.Idle)
  • ltssm=10 (L0)
  • ltssm=0b (Recovery.RcvrLock)
  • ltssm=0d (Recovery.RcvrCfg)
  • ltssm=0c (Recovery.Speed)
  • ltssm=0b (Recovery.RcvrLock)
  • ltssm=28 (Recovery_Equalization_Phase0)
  • ltssm=29 (Recovery_Equalization_Phase1)
  • ltssm=2a (Recovery_Equalization_Phase2)
  • ltssm=2b (Recovery_Equalization_Phase3)
  • ltssm=0b (Recovery.RcvrLock)
  • ltssm=0d (Recovery.RcvrCfg)
  • ltssm=0e (Recovery.Idle)
  • ltssm=10 (L0)

Trigger retrain ( set retrain bit on the root port bridge )

  • ltssm=10 (L0) [ initial ]
  • ltssm=0b (Recovery.RcvrLock)
  • ltssm=0d (Recovery.RcvrCfg)
  • ltssm=0e (Recovery.Idle)
  • ltssm=10 (L0)

Disable link : ( set disable link bit on the root port bridge )

  • ltssm=10 (L0) [ Initial ]
  • ltssm=0b (Recovery.RcvrLock)
  • ltssm=0d (Recovery.RcvrCfg)
  • ltssm=0e (Recovery.Idle)
  • ltssm=20 (Disabled)

Enable link : ( clear disable link bit on the root port bridge )

  • ltssm=20 (Disabled) [ Initial ]
  • ltssm=00 (Detect.Quiet)
  • ltssm=01 (Detect.Active)
  • ltssm=02 (Polling.Active)
  • ltssm=04 (Polling.Configuration)
  • ltssm=05 (Configuration.Linkwidth.Start)
  • ltssm=06 (Configuration.Linkwidth.Accept)
  • ltssm=08 (Configuration.Lanenum.Wait)
  • ltssm=07 (Configuration.Lanenum.Accept)
  • ltssm=09 (Configuration.Complete)
  • ltssm=0a (Configuration.Idle)
  • ltssm=10 (L0)
  • ltssm=0b (Recovery.RcvrLock)
  • ltssm=0d (Recovery.RcvrCfg)
  • ltssm=0c (Recovery.Speed)
  • ltssm=0b (Recovery.RcvrLock)
  • ltssm=28 (Recovery_Equalization_Phase0)
  • ltssm=29 (Recovery_Equalization_Phase1)
  • ltssm=2a (Recovery_Equalization_Phase2)
  • ltssm=2b (Recovery_Equalization_Phase3)
  • ltssm=0b (Recovery.RcvrLock)
  • ltssm=0d (Recovery.RcvrCfg)
  • ltssm=0e (Recovery.Idle)
  • ltssm=10 (L0)

Reboot

This can end up in one of two scenarios :

  • ltssm=0b (Recovery.RcvrLock)
    • status=0 (Link Down), phy_down=0 (PHY Link Up), phy_status=3 (Link up, DL initialization completed), rate=2 (8.0 GT/s), width=2 (4-Lane link), ltssm=0b (Recovery.RcvrLock)

or :

  • ltssm=00 (Detect.Quiet)
    • status=0 (Link Down), phy_down=1 (PHY Link Down), phy_status=0 (No receivers detected), rate=0 (2.5 GT/s), width=0 (1-Lane link), ltssm=00 (Detect.Quiet)

In both cases, I can't get it to change state ever again, no matter what bits I try to poke on the root bridge (setting it to sleep, disabling/re-enabling link, request retrain, ...)

smunaut avatar Mar 01 '22 13:03 smunaut

Other interesting result is I tried commenting out https://github.com/enjoy-digital/litepcie/blob/master/litepcie/phy/usppciephy.py#L97

Idea is that I can imagine some of the logic wants to see a clock when reset is asserted.

And that seems to reliably allows the FPGA to be detected through a reboot !

It's not all perfect though, because when I do that, it seems I can no longer dynamically reload a bitstream when the machine is booted :/ The PCIe core then follow that sequence during a "dynamic" reload :

  • ltssm=01 (Detect.Active)
  • ltssm=02 (Polling.Active)
  • ltssm=03 (Polling.Compliance)

And I can't get it out of it. (even manually asserting the pcie_rst_n on the core just makes it go through the same sequence). If I reboot the machine, it will train and work just fine though.

This might be something to do with the bios programming something differently when the card is detected at boot and when it's not that prevents a dynamic reload.

smunaut avatar Mar 01 '22 14:03 smunaut

Closing as I don't think the remaining weirdness is LitePCIe related and the removal of the clock gating on reset fixed most of them.

smunaut avatar Sep 27 '22 19:09 smunaut