obstacle-tower-env
obstacle-tower-env copied to clipboard
GCP tutorial suggests using T4 GPU to save costs, but fails when using T4 GPU
Update: GCP tutorial suggests using T4 GPU to save costs, but fails when using T4 GPU (error below)
Hi, I am following the tutorial Training an Obstacle Tower agent using Dopamine and the Google Cloud Platform
I am getting the following error - I believe the problem is (EE) NVIDIA(GPU-0): UseDisplayDevice "None" is not supported with GRID - but I'm not sure of the root cause.
I was trying to use the T4 GPU to save $$ - I will try again with the default GPU
after typing
sudo /usr/bin/X :0 &
export DISPLAY=:0
I get this error
X.Org X Server 1.19.2
Release Date: 2017-03-02
X Protocol Version 11, Revision 0
Build Operating System: Linux 4.9.0-8-amd64 x86_64 Debian
Current Operating System: Linux tensorflow-1-vm 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64
Kernel command line: BOOT_IMAGE=/boot/vmlinuz-4.9.0-8-amd64 root=UUID=995b3d50-0ab0-4faa-8296-ab743ab0fde7 ro net.ifnames=0 biosdevname=0 console=ttyS0,38400n8 elevator=noop scsi_mod.use_blk_mq=Y
Build Date: 03 November 2018 03:09:11AM
xorg-server 2:1.19.2-1+deb9u5 (https://www.debian.org/support)
Current version of pixman: 0.34.0
Before reporting problems, check http://wiki.x.org
to make sure that you have the latest version.
Markers: (--) probed, (**) from config file, (==) default setting,
(++) from command line, (!!) notice, (II) informational,
(WW) warning, (EE) error, (NI) not implemented, (??) unknown.
(==) Log file: "/var/log/Xorg.0.log", Time: Thu Feb 14 01:06:15 2019
(==) Using config file: "/etc/X11/xorg.conf"
(==) Using system config directory "/usr/share/X11/xorg.conf.d"
(EE)
Fatal server error:
(EE) no screens found(EE)
/var/log/Xorg.0.log
[ 385.871] (II) Module "ramdac" already built-in
[ 385.877] (**) NVIDIA(0): Depth 24, (--) framebuffer bpp 32
[ 385.877] (==) NVIDIA(0): RGB weight 888
[ 385.877] (==) NVIDIA(0): Default visual is TrueColor
[ 385.877] (==) NVIDIA(0): Using gamma correction (1.0, 1.0, 1.0)
[ 385.877] (**) NVIDIA(0): Option "UseDisplayDevice" "None"
[ 385.877] (**) NVIDIA(0): Enabling 2D acceleration
[ 385.877] (**) NVIDIA(0): Option "UseDisplayDevice" set to "none"; enabling NoScanout
[ 385.877] (**) NVIDIA(0): mode
[ 385.877] (II) Loading sub module "glxserver_nvidia"
[ 385.877] (II) LoadModule: "glxserver_nvidia"
[ 385.877] (II) Loading /usr/lib/xorg/modules/extensions/libglxserver_nvidia.so
[ 385.882] (II) Module glxserver_nvidia: vendor="NVIDIA Corporation"
[ 385.882] compiled for 4.0.2, module version = 1.0.0
[ 385.882] Module class: X.Org Server Extension
[ 385.882] (II) NVIDIA GLX Module 410.72 Wed Oct 17 20:11:21 CDT 2018
[ 386.482] (EE) NVIDIA(GPU-0): UseDisplayDevice "None" is not supported with GRID
[ 386.482] (EE) NVIDIA(GPU-0): displayless
[ 386.482] (EE) NVIDIA(GPU-0): Failed to select a display subsystem.
[ 386.563] (EE) NVIDIA(0): Failing initialization of X screen 0
[ 386.563] (II) UnloadModule: "nvidia"
[ 386.563] (II) UnloadSubModule: "glxserver_nvidia"
[ 386.563] (II) Unloading glxserver_nvidia
[ 386.563] (II) UnloadSubModule: "wfb"
[ 386.563] (II) UnloadSubModule: "fb"
[ 386.563] (EE) Screen(s) found, but none have a usable configuration.
[ 386.563] (EE)
Fatal server error:
[ 386.563] (EE) no screens found(EE)
[ 386.563] (EE)
Please consult the The X.Org Foundation support
at http://wiki.x.org
for help.
[ 386.563] (EE) Please also check the log file at "/var/log/Xorg.0.log" for additional information.
[ 386.563] (EE)
[ 386.564] (EE) Server terminated with error (1). Closing log file.
OK - the problem is with the T4 GPU - I've been able to get it running with the default GPU.
It would be good to figure this out as the T4 is 1/3rd of the price
@ervteng Do you know about using different GPUs in this scenario?
I've been able to use both T4 and P4 GPUs for training Unity environments (including Obstacle Tower). @Sohojoe do you have the /etc/X11/xorg.conf for the problematic machine?
here you go:
# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig: version 410.72
Section "ServerLayout"
Identifier "Layout0"
Screen 0 "Screen0"
InputDevice "Keyboard0" "CoreKeyboard"
InputDevice "Mouse0" "CorePointer"
EndSection
Section "Files"
EndSection
Section "InputDevice"
# generated from default
Identifier "Mouse0"
Driver "mouse"
Option "Protocol" "auto"
Option "Device" "/dev/psaux"
Option "Emulate3Buttons" "no"
Option "ZAxisMapping" "4 5"
EndSection
Section "InputDevice"
# generated from default
Identifier "Keyboard0"
Driver "kbd"
EndSection
Section "Monitor"
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Unknown"
HorizSync 28.0 - 33.0
VertRefresh 43.0 - 72.0
Option "DPMS"
EndSection
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "Tesla T4"
BusID "PCI:0:4:0"
EndSection
Section "Screen"
Identifier "Screen0"
Device "Device0"
Monitor "Monitor0"
DefaultDepth 24
Option "UseDisplayDevice" "None"
SubSection "Display"
Virtual 1280 1024
Depth 24
EndSubSection
EndSection
These are the options it gives me:
I've been getting the same error too.I am using a T4 and have done all the previous steps completely.Here is my xorg.conf file:
# nvidia-xconfig: X configuration file generated by nvidia-xconfig
# nvidia-xconfig: version 410.72
Section "ServerLayout"
Identifier "Layout0"
Screen 0 "Screen0"
InputDevice "Keyboard0" "CoreKeyboard"
InputDevice "Mouse0" "CorePointer"
EndSection
Section "Files"
EndSection
Section "InputDevice"
# generated from default
Identifier "Mouse0"
Driver "mouse"
Option "Protocol" "auto"
Option "Device" "/dev/psaux"
Option "Emulate3Buttons" "no"
Option "ZAxisMapping" "4 5"
EndSection
Section "InputDevice"
# generated from default
Identifier "Keyboard0"
Driver "kbd"
EndSection
Section "Monitor"
Identifier "Monitor0"
VendorName "Unknown"
ModelName "Unknown"
HorizSync 28.0 - 33.0
VertRefresh 43.0 - 72.0
Option "DPMS"
EndSection
Section "Device"
Identifier "Device0"
Driver "nvidia"
VendorName "NVIDIA Corporation"
BoardName "Tesla T4"
BusID "0:4:0"
Option "AllowEmptyInitialConfiguration"
EndSection
Section "Screen"
Identifier "Screen0"
Device "Device0"
Monitor "Monitor0"
DefaultDepth 24
Option "UseDisplayDevice" "None"
SubSection "Display"
Virtual 1280 1024
Depth 24
EndSubSection
EndSection
Any suggestion on this? I also encounter into this issue.
I find the solution and it works for me:
delete or comment(with "#") ServerLayout and Screen section in /etc/X11/xorg.conf file
same issue & solution for tesla V100
For me only removing Option "UseDisplayDevice" "none" in "Screen" Section does also the trick.
@zhenghongzhi @juge2 guys you've helped us so much! thank you!