edgetpu icon indicating copy to clipboard operation
edgetpu copied to clipboard

Docker Device Passthrough - map pfn RAM range req uncached-minus

Open MattW2 opened this issue 3 years ago • 16 comments

Trying to get a Coral Mini PCIe running on an Unraid Server.

  • Intel Xeon E3-1246 v3 CPU
  • Unraid 6.9.1 Host - Linux Kernal 5.10.21
  • mini PCI to PCI Adapter I am using: Ableconn PEX-MP117 Mini PCI-E to PCI-E Adapter Card
  • Card appears to be correctly identified by the host Coral Edge TPU IOMMU
  • From terminal, if I run the below command I get a return suggesting the card is correctly installed
root@Tower:~# ls /dev/apex_0
/dev/apex_0

I'm passing the card (/dev/apex_0) to a docker container (frigate). I see the below error in my system log and the coral doesn't appear to be passed though correctly

Mar 24 12:52:10 Tower kernel: x86/PAT: frigate.detecto:29004 map pfn RAM range req uncached-minus for [mem 0x6f1c4c000-0x6f1c4ffff], got write-back
Mar 24 12:53:13 Tower kernel: apex 0000:06:00.0: RAM did not enable within timeout (12000 ms)
Mar 24 12:53:25 Tower kernel: apex 0000:06:00.0: RAM did not enable within timeout (12000 ms)
Mar 24 12:53:25 Tower kernel: apex 0000:06:00.0: Error in device open cb: -110

Any idea what may be going on here and what next steps I may take to get this working?

MattW2 avatar Mar 25 '21 13:03 MattW2

@MattW2 Although this is possible, please keep in mind that we don't support VMs. couple of things I would check first:

  1. Can it be use in the host machine outside of the docker container?
  2. When running it, do you pass it the permission? something like this:
--device /dev/apex_0:/dev/apex_0

Namburger avatar Mar 25 '21 21:03 Namburger

@MattW2 Although this is possible, please keep in mind that we don't support VMs. couple of things I would check first:

  1. Can it be use in the host machine outside of the docker container?
  2. When running it, do you pass it the permission? something like this:
--device /dev/apex_0:/dev/apex_0

Thank you for the advice, no luck yet getting this to work but here is what I have tried.

To see if another application or the host itself is using the device by running lsof | grep /dev/apex_0 Didn't find anything

Yes, I am passing the device using exactly that docker line, here is my docker command. Ignore all the references to Blueiris, just using that as a shared located to store video files.

/usr/local/emhttp/plugins/dynamix.docker.manager/scripts/docker create 
--name='frigate' 
--net='br0' 
--ip='192.168.1.30' 
--cpuset-cpus='2,3' 
--privileged=true 
-e TZ="America/New_York" 
-e HOST_OS="Unraid" 
-e 'TCP_PORT_5000'='5000' 
-e 'TCP_PORT_1935'='1935' 
-e 'FRIGATE_RTSP_PASSWORD'='enterpassword' 
-v '/mnt/user/appdata/frigate/config/':'/config':'rw' 
-v '/mnt/user/Blue Iris/frigate_video/clips/':'/media/frigate/clips':'rw' 
-v '/mnt/user/Blue Iris/frigate_video/recordings/':'/media/frigate/recordings':'rw' 
-v '/mnt/user/Blue Iris/frigate_video/clips/':'/clips':'rw' -v '/etc/localtime':'/etc/localtime':'rw' 
--device='/dev/dri/renderD128' 
--device /dev/apex_0:/dev/apex_0 
--shm-size=1024m 
--mount type=tmpfs,destination=/tmp/cache,tmpfs-size=1000000000 
'blakeblackshear/frigate:stable-amd64'

I've tried a few variants of passing the device including --device='/dev/apex_0' --device=/dev/apex_0 All with the same results, I get this error in the host system

Mar 28 13:28:31 Tower kernel: eth0: renamed from veth7219499
Mar 28 13:28:45 Tower kernel: apex 0000:03:00.0: RAM did not enable within timeout (12000 ms)
Mar 28 13:28:45 Tower kernel: apex 0000:03:00.0: Error in device open cb: -110

The eth0 error is new and makes me wonder if there is a conflict with my ethernet card. Lets see if remove one gets me anywhere.

Update: Removed Ethernet card, no difference. Ended up moving the Coral m-PCIe card with adapter to another box (Windows) and it works perfectly. So now i'm just about out of ideas and may wait for the USB cards to be available again and give that a try.

MattW2 avatar Mar 28 '21 17:03 MattW2

@Namburger , any advice on how I might identify the application that could be using the device on the host?

MattW2 avatar Mar 28 '21 23:03 MattW2

I am experiencing this exact same issue. I am running mine on docker on ubuntu inside proxmox (pcie passthrough is working and everything) Docker for some reason does not like it image

HeedfulCrayon avatar May 15 '21 09:05 HeedfulCrayon

I have the same issue but I has been working for some weeks. I had to reboot after a system crash (maybe produced by the device?) and then it stopped working. Coral pci device seems to be installed properly and is recognized by the system as apex_0 but no luck making it work again. No updates were applied before system crash and reboot.

koldogut avatar Jul 13 '21 22:07 koldogut

Having the same issue as @koldogut. Everything was working fine in the docker container for a bit, then after a restart of the VM it wouldn't work anymore.

bowen-song avatar Jan 21 '22 01:01 bowen-song

In case anyone is looking for the fix, here's what worked for me on my machine.

On proxmox in /etc/default/grub, make sure you include pcie_aspm=off in default boot: GRUB_CMDLINE_LINUX_DEFAULT="pcie_aspm=off"

This seems to stop proxmox from allowing the pci device to enter a sleep state. After updating grub, make sure to run update-grub

I've also included the same pcie_aspm=off in my Ubuntu VM, but I'm not sure if this was necessary.

Rebooted proxmox + VM and things loaded in as expected.

bowen-song avatar Jan 21 '22 03:01 bowen-song

@bowen-song I have that in my set up as well, and it mostly works, but there is an occasional reboot that causes the TPU to no longer work. Normally just rebooting the VM will get it working again, but it is still frustrating

HeedfulCrayon avatar Jan 21 '22 16:01 HeedfulCrayon

In case anyone is looking for the fix, here's what worked for me on my machine.

On proxmox in /etc/default/grub, make sure you include pcie_aspm=off in default boot: GRUB_CMDLINE_LINUX_DEFAULT="pcie_aspm=off"

This seems to stop proxmox from allowing the pci device to enter a sleep state. After updating grub, make sure to run update-grub

I've also included the same pcie_aspm=off in my Ubuntu VM, but I'm not sure if this was necessary.

Rebooted proxmox + VM and things loaded in as expected.

Thank you! This seems to have solved it for me aswell :D

Stockhauz avatar Apr 20 '23 21:04 Stockhauz

In case anyone is looking for the fix, here's what worked for me on my machine.

On proxmox in /etc/default/grub, make sure you include pcie_aspm=off in default boot: GRUB_CMDLINE_LINUX_DEFAULT="pcie_aspm=off"

This seems to stop proxmox from allowing the pci device to enter a sleep state. After updating grub, make sure to run update-grub

I've also included the same pcie_aspm=off in my Ubuntu VM, but I'm not sure if this was necessary.

Rebooted proxmox + VM and things loaded in as expected.

Thanks alot, this did the trick. I only added this on the Proxmox host, not the VM.

schmiegelt avatar Jun 05 '23 15:06 schmiegelt

I'm running into a similar situation. The EdgeTPU was working great under Unraid for a year or so, but recently switched to Ubuntu 23; now things work fine for a while (like 4-12 hours), and then suddenly I get these every time Frigate attempts to connect to the TPU (or maybe every time Docker tries to map the TPU-- unclear which):

apex 0000:01:00.0: Error in device open cb: -110
apex 0000:01:00.0: RAM did not enable within timeout (12000 ms)

I've confirmed that ASPM is disabled for the Coral card. Running through these steps does seem to resolve the problem at least temporarily:

echo 1 | sudo tee /sys/bus/pci/devices/[deviceid]/remove
sleep 1
echo 1 | sudo tee /sys/bus/pci/rescan

So it seems like maybe it's getting into some sort of unrecoverable state and needs to be reset/power cycled. Though after the reset I'm also seeing just one of these in the logs:

x86/PAT: frigate.detecto:664214 map pfn RAM range req uncached-minus for [mem 0x112964000-0x112967fff], got write-back

but then everything seems to be okay after that. 🤷‍♂️

Sammy1Am avatar Jul 29 '23 19:07 Sammy1Am

Same issue, it’s been working and suddenly it’s not assigning ram to the coral pci-e device.

txwireless avatar Jul 30 '23 03:07 txwireless

I had the same error message and it was a reproducible issue for me related to temperature. I had been using the PCIe TPU reliably with a single core. Once I added the Dual TPU B+M adapter key and started leveraging both cores, temperatures would climb to 100 and then the errors would start.

roycamp avatar Sep 08 '23 16:09 roycamp

I am seeing same issue, it might work fine for days and then suddenly get

[18087.152851] apex 0000:06:00.0: RAM did not enable within timeout (12000 ms)
[18087.159873] apex 0000:06:00.0: Error in device open cb: -110

After this only way to recover is power cycle or remove/rescan pci.

This is on Proxmox/Debian 12.

Any idea how to stop this from happening as it is very annoying?

s0129 avatar Nov 01 '23 22:11 s0129