neural-style-docker icon indicating copy to clipboard operation
neural-style-docker copied to clipboard

Remote server install failure

Open wboykinm opened this issue 6 years ago • 2 comments

Following the remote-launch outline laid out in @albarji's blog post . . .

  1. Booting a remote p2.xlarge server with Ubuntu 16.04.2 LTS (GNU/Linux 4.4.0-1020-aws x86_64)
  2. Cloning the repo
  3. Running the install script

. . . I get this:

./scripts/install-nvidia.sh
Reading package lists... Done
Building dependency tree       
Reading state information... Done
Note, selecting 'libc6-dev' instead of 'libc-dev'
gcc is already the newest version (4:5.3.1-1ubuntu1).
make is already the newest version (4.1-6).
libc6-dev is already the newest version (2.23-0ubuntu9).
0 upgraded, 0 newly installed, 0 to remove and 128 not upgraded.
--2018-01-11 15:31:19--  http://us.download.nvidia.com/XFree86/Linux-x86_64/361.42/NVIDIA-Linux-x86_64-361.42.run
Resolving us.download.nvidia.com (us.download.nvidia.com)... 192.229.211.70, 2606:2800:21f:3aa:dcf:37b:1ed6:1fb
Connecting to us.download.nvidia.com (us.download.nvidia.com)|192.229.211.70|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 86760004 (83M) [application/octet-stream]
Saving to: ‘/tmp/NVIDIA-Linux-x86_64-361.42.run.1’

NVIDIA-Linux-x86_64-361.42.run.1             100%[=============================================================================================>]  82.74M   140MB/s    in 0.6s    

2018-01-11 15:31:19 (140 MB/s) - ‘/tmp/NVIDIA-Linux-x86_64-361.42.run.1’ saved [86760004/86760004]

Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 361.42...................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

ERROR: Unable to load the kernel module 'nvidia.ko'.  This happens most frequently when this kernel module was built against the wrong or improperly configured kernel sources,
       with a version of gcc that differs from the one used to build the target kernel, or if a driver such as rivafb, nvidiafb, or nouveau is present and prevents the NVIDIA
       kernel module from obtaining ownership of the NVIDIA graphics device(s), or no NVIDIA GPU installed in this system is supported by this NVIDIA Linux graphics driver
       release.
       
       Please see the log entries 'Kernel module load error' and 'Kernel messages' at the end of the file '/var/log/nvidia-installer.log' for more information.


ERROR: Installation has failed.  Please see the file '/var/log/nvidia-installer.log' for details.  You may find suggestions on fixing installation problems in the README
       available on the Linux driver download page at www.nvidia.com.

--2018-01-11 15:31:53--  https://github.com/NVIDIA/nvidia-docker/releases/download/v1.0.1/nvidia-docker_1.0.1-1_amd64.deb
Resolving github.com (github.com)... 192.30.253.113, 192.30.253.112
Connecting to github.com (github.com)|192.30.253.113|:443... connected.
HTTP request sent, awaiting response... 502 Bad Gateway
2018-01-11 15:31:53 ERROR 502: Bad Gateway.

dpkg: error processing archive /tmp/nvidia-docker*.deb (--install):
 cannot access archive: No such file or directory
Errors were encountered while processing:
 /tmp/nvidia-docker*.deb
sudo: nvidia-docker: command not found

This seems like a driver mismatch. I'm unable to test this locally, unfortunately (wrong GPU), so I'm left to guess if the image needs rebuilding or if I need to change my EC2 config somehow. It looks like the appropriate driver version needs a bump.

wboykinm avatar Jan 11 '18 16:01 wboykinm

UPDATE: I bumped the driver to the [apparently] current version, and it threw the same error as above.

wboykinm avatar Jan 11 '18 19:01 wboykinm

Hey @wboykinm ! It's been a while since I last used that script for deploying this container, so I'm afraid it's pretty much outdated. My recommendation right now would be to create an instance based on one of the AMIs provided by NVIDIA, which already comes prepared with the appropriate drivers and nvidia-toolkit versions.

I use the AMI named "NVIDIA CUDA Toolkit 7.5 on Amazon Linux" an that one works pretty well, the only thing you need to manually install after creating the instance would be docker and nvidia-docker. After that you should be ready to run the container!

albarji avatar Jan 16 '18 22:01 albarji