gpu-operator icon indicating copy to clipboard operation
gpu-operator copied to clipboard

repoConfig to override /etc/apt/sources.list is not working

Open withoutnickname opened this issue 4 years ago • 6 comments

k8s - 1.18.10, self-hosted, w/o Internet access workers - Ubuntu 18.04.4 #-------------------------------------- nvidia-driver-daemonset pod fails during packages update (because of private cluster):

Checking NVIDIA driver packages...
Updating the package cache...
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal/InRelease  Could not connect to archive.ubuntu.com:80 (91.189.88.152), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.88.142), connection timed out
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease  Unable to connect to archive.ubuntu.com:http:
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-security/InRelease  Unable to connect to archive.ubuntu.com:http:
W: Some index files failed to download. They have been ignored, or old ones used instead.
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.4.0-80-generic
Installing Linux kernel headers...
E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/l/linux/linux-headers-5.4.0-80_5.4.0-80.90_all.deb  Could not connect to archive.ubuntu.com:80 (91.189.88.152), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.88.142), connection timed out
E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/l/linux/linux-headers-5.4.0-80-generic_5.4.0-80.90_amd64.deb  Unable to connect to archive.ubuntu.com:http:
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?

To overcome it, we tried to specify our custom deb sources in operator ClusterPolicy via:

    repoConfig:
      configMapName: repo-config
      destinationDir: /etc/apt/sources.list.d

but it does not work as expected: default sources.list stays in use as well. BTW, after manual update of default sources.list to our custom values - container proceeds successfully.

If we try destinationDir: /etc/apt/ instead - all other files in directory will be removed.

Please consider to add smth like subPath for RepoConfig.

withoutnickname avatar Sep 07 '21 10:09 withoutnickname

but it does not work as expected: default sources.list stays in use as well.

Can you provide driver logs for this case? It is expected behavior for the default source.list file to remain the same -- which means you will still see similar warning messages in the driver logs. However, if your repo-config file is configured correctly, it should be able to pull the packages from your local mirror and should proceed without failure.

cdesiniotis avatar Sep 07 '21 16:09 cdesiniotis

Hi @cdesiniotis! Logs are the same (from nvidia-driver-ctr container):

Creating directory NVIDIA-Linux-x86_64-470.57.02
Verifying archive integrity... OK
Uncompressing NVIDIA Accelerated Graphics Driver for Linux-x86_64 470.57.02.......................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................................

WARNING: Unable to determine the default X library path. The path /tmp/null/lib will be used, but this path was not detected in the ldconfig(8) cache, and no directory exists at this path, so it is likely that libraries installed there will not be found by the loader.


WARNING: You specified the '--no-kernel-module' command line option, nvidia-installer will not install a kernel module as part of this driver installation, and it will not remove existing NVIDIA kernel modules not part of an earlier NVIDIA driver installation.  Please ensure that an NVIDIA kernel module matching this driver version is installed separately.


========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 470.57.02 for Linux kernel version 5.4.0-80-generic

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal/InRelease  Could not connect to archive.ubuntu.com:80 (91.189.88.142), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.88.152), connection timed out
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease  Unable to connect to archive.ubuntu.com:http:
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-security/InRelease  Unable to connect to archive.ubuntu.com:http:
W: Some index files failed to download. They have been ignored, or old ones used instead.
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.4.0-80-generic
Installing Linux kernel headers...
E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/l/linux/linux-headers-5.4.0-80_5.4.0-80.90_all.deb  Could not connect to archive.ubuntu.com:80 (91.189.88.152), connection timed out Could not connect to archive.ubuntu.com:80 (91.189.88.142), connection timed out
E: Failed to fetch http://archive.ubuntu.com/ubuntu/pool/main/l/linux/linux-headers-5.4.0-80-generic_5.4.0-80.90_amd64.deb  Unable to connect to archive.ubuntu.com:http:
E: Unable to fetch some archives, maybe run apt-get update or try with --fix-missing?
Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...

and pod is restarting constantly.. configmap is created and specified in operator with valid data - tested manually by updating default source.list

withoutnickname avatar Sep 09 '21 09:09 withoutnickname

Can you confirm that /etc/apt/sources.list.d/ gets created inside the driver container and your custom repo file can be found in that directory?

Edit: Can you also confirm that your repo list file ends with extension .list? What version of the gpu operator are you running?

cdesiniotis avatar Sep 09 '21 19:09 cdesiniotis

Hi @cdesiniotis! Sorry for late response. Yes, confirm both. File with the name sources.list is created in /etc/apt/sources.list.d/. Did according to instruction.

gpu-operator:v1.8.0

withoutnickname avatar Sep 16 '21 12:09 withoutnickname

I am also experiencing this issue with v1.9.1 as well as the master branch of this repo.

The configMap for the repo config simply isnt getting mounted by the driver installation pod.

jcstryker avatar Mar 08 '22 21:03 jcstryker

@jcstryker Could you provide some more detail? Are there any driver logs you can provide?

I brought up a disconnected environment with a repository mirror, but was not able to reproduce any issue with GPU Operator v1.9.1

For reference, I did the following:

... create repo-config ConfigMap in the gpu-operator namespace ...
$ cat custom-repo.list
deb [arch=amd64] http://<mirror-ip>/ubuntu/mirror/archive.ubuntu.com/ubuntu focal main universe
deb [arch=amd64] http://<mirror-ip>/ubuntu/mirror/archive.ubuntu.com/ubuntu focal-updates main universe
deb [arch=amd64] http://<mirror-ip>/ubuntu/mirror/archive.ubuntu.com/ubuntu focal-security main universe
$ kubectl create ns gpu-operator
$ kubectl create cm repo-config -n gpu-operator --from-file=custom-repo.list

... modify the default values for the driver pod ...
$ vi values.yaml
...
driver:
  ...
  env:
    - name: http_proxy
      value: <proxy-ip>
    - name: https_proxy
      value: <proxy-ip>
    - name: HTTP_PROXY
      value: <proxy-ip>
    - name: HTTPS_PROXY
      value: <proxy-ip>
  ...
  repoConfig:
    configMapName: "repo-config"
  ...
 
$ helm install gpu-operator gpu-operator-v1.9.1.tgz -n gpu-operator -f value.yaml --wait

The custom-repo.list file gets properly mounted inside the driver pod at /etc/apt/sources.d/

$ kubectl exec -it -n gpu-operator -c nvidia-driver-ctr nvidia-driver-daemonset-tbhmm -- ls -ltr /etc/apt/sources.list.d/
total 8
-rw-r--r-- 1 root root  81 Jan  7 01:27 cuda.list
-rw-r--r-- 1 root root 311 Mar  9 01:01 custom-repo.list

And from the driver logs, the driver pod is able to download the necessary kernel packages.

========== NVIDIA Software Installer ==========

Starting installation of NVIDIA driver version 470.82.01 for Linux kernel version 5.4.0-104-generic

Stopping NVIDIA persistence daemon...
Unloading NVIDIA driver kernel modules...
Unmounting NVIDIA driver rootfs...
Checking NVIDIA driver packages...
Updating the package cache...
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal/InRelease  503  Service Unavailable [IP: <proxy-ip> 80]
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-updates/InRelease  503  Service Unavailable [IP: <proxy-ip> 80]
W: Failed to fetch http://archive.ubuntu.com/ubuntu/dists/focal-security/InRelease  503  Service Unavailable [IP: <proxy-ip> 80]
W: Failed to fetch https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/InRelease  Reading from proxy failed - select (115: Operation now in progress) [IP: <proxy-ip> 80]
W: Some index files failed to download. They have been ignored, or old ones used instead.
Resolving Linux kernel version...
Proceeding with Linux kernel version 5.4.0-104-generic
Installing Linux kernel headers...
Installing Linux kernel module files...
Generating Linux kernel version string...
Compiling NVIDIA driver kernel modules...

@jcstryker Are you following a similar procedure?

@withoutnickname sorry for the long delay on this issue. Do you have any updates?

cdesiniotis avatar Mar 09 '22 01:03 cdesiniotis