kubekey icon indicating copy to clipboard operation
kubekey copied to clipboard

尝试离线安装时出现了错误

Open skyhhjmk opened this issue 8 months ago • 6 comments

What is version of KubeKey has the issue?

kk version: &version.Info{Major:"3", Minor:"0", GitVersion:"v3.0.7", GitCommit:"e755baf67198d565689d7207378174f429b508ba", GitTreeState:"clean", BuildDate:"2023-01-18T01:57:24Z", GoVersion:"go1.19.2", Compiler:"gc", Platform:"linux/amd64"}

What is your os environment?

ubuntu22.04

KubeKey config file

apiVersion: kubekey.kubesphere.io/v1alpha2
kind: Cluster
metadata:
  name: hhjmk-kube
spec:
  hosts:
  - {name: master, address: 10.111.0.1, internalAddress: 10.111.0.1, user: root, privateKeyPath: "/root/pri-key"}
  - {name: harbor, address: 10.111.0.100, internalAddress: 10.111.0.100, user: root, privateKeyPath: "/root/pri-key"}
  - {name: node1, address: 10.111.0.2, internalAddress: 10.111.0.2, user: root, privateKeyPath: "/root/pri-key"}
  roleGroups:
    etcd:
    - master
    control-plane: 
    - master
    worker:
    - node1
    registry:
    - harbor
  controlPlaneEndpoint:
    ## Internal loadbalancer for apiservers 
    # internalLoadbalancer: haproxy

    domain: lb.kubesphere.local
    address: ""
    port: 6443
  kubernetes:
    version: v1.21.14
    clusterName: hhjmk.kube
    autoRenewCerts: true
    containerManager: docker
  etcd:
    type: kubekey
  network:
    plugin: calico
    kubePodsCIDR: 10.233.64.0/18
    kubeServiceCIDR: 10.233.0.0/18
    ## multus support. https://github.com/k8snetworkplumbingwg/multus-cni
    multusCNI:
      enabled: false
  registry:
    type: harbor
    auths:
      "dockerhub.kubekey.local":
        username: admin
        password: [I delete this]
    privateRegistry: "dockerhub.kubekey.local"
    namespaceOverride: "kubesphereio"
      #privateRegistry: ""
      #namespaceOverride: ""
    registryMirrors: []
    insecureRegistries: []
  addons: []



---
apiVersion: installer.kubesphere.io/v1alpha1
kind: ClusterConfiguration
metadata:
  name: ks-installer
  namespace: kubesphere-system
  labels:
    version: v3.3.2
spec:
  persistence:
    storageClass: ""
  authentication:
    jwtSecret: ""
  zone: ""
  local_registry: ""
  namespace_override: ""
  # dev_tag: ""
  etcd:
    monitoring: false
    endpointIps: localhost
    port: 2379
    tlsEnable: true
  common:
    core:
      console:
        enableMultiLogin: true
        port: 30880
        type: NodePort
    # apiserver:
    #  resources: {}
    # controllerManager:
    #  resources: {}
    redis:
      enabled: false
      volumeSize: 2Gi
    openldap:
      enabled: false
      volumeSize: 2Gi
    minio:
      volumeSize: 20Gi
    monitoring:
      # type: external
      endpoint: http://prometheus-operated.kubesphere-monitoring-system.svc:9090
      GPUMonitoring:
        enabled: false
    gpu:
      kinds:
      - resourceName: "nvidia.com/gpu"
        resourceType: "GPU"
        default: true
    es:
      # master:
      #   volumeSize: 4Gi
      #   replicas: 1
      #   resources: {}
      # data:
      #   volumeSize: 20Gi
      #   replicas: 1
      #   resources: {}
      logMaxAge: 7
      elkPrefix: logstash
      basicAuth:
        enabled: false
        username: ""
        password: ""
      externalElasticsearchHost: ""
      externalElasticsearchPort: ""
  alerting:
    enabled: false
    # thanosruler:
    #   replicas: 1
    #   resources: {}
  auditing:
    enabled: false
    # operator:
    #   resources: {}
    # webhook:
    #   resources: {}
  devops:
    enabled: false
    # resources: {}
    jenkinsMemoryLim: 8Gi
    jenkinsMemoryReq: 4Gi
    jenkinsVolumeSize: 8Gi
  events:
    enabled: false
    # operator:
    #   resources: {}
    # exporter:
    #   resources: {}
    # ruler:
    #   enabled: true
    #   replicas: 2
    #   resources: {}
  logging:
    enabled: false
    logsidecar:
      enabled: true
      replicas: 2
      # resources: {}
  metrics_server:
    enabled: false
  monitoring:
    storageClass: ""
    node_exporter:
      port: 9100
      # resources: {}
    # kube_rbac_proxy:
    #   resources: {}
    # kube_state_metrics:
    #   resources: {}
    # prometheus:
    #   replicas: 1
    #   volumeSize: 20Gi
    #   resources: {}
    #   operator:
    #     resources: {}
    # alertmanager:
    #   replicas: 1
    #   resources: {}
    # notification_manager:
    #   resources: {}
    #   operator:
    #     resources: {}
    #   proxy:
    #     resources: {}
    gpu:
      nvidia_dcgm_exporter:
        enabled: false
        # resources: {}
  multicluster:
    clusterRole: none
  network:
    networkpolicy:
      enabled: false
    ippool:
      type: none
    topology:
      type: none
  openpitrix:
    store:
      enabled: false
  servicemesh:
    enabled: false
    istio:
      components:
        ingressGateways:
        - name: istio-ingressgateway
          enabled: false
        cni:
          enabled: false
  edgeruntime:
    enabled: false
    kubeedge:
      enabled: false
      cloudCore:
        cloudHub:
          advertiseAddress:
            - ""
        service:
          cloudhubNodePort: "30000"
          cloudhubQuicNodePort: "30001"
          cloudhubHttpsNodePort: "30002"
          cloudstreamNodePort: "30003"
          tunnelNodePort: "30004"
        # resources: {}
        # hostNetWork: false
      iptables-manager:
        enabled: true
        mode: "external"
        # resources: {}
      # edgeService:
      #   resources: {}
  terminal:
    timeout: 600

A clear and concise description of what happend.

在尝试离线安装 的时候遇到了错误,此处是我第三次尝试重新安装,出现了新的错误,先前的错误是貌似是由于尝试删除只读文件系统中的文件所引起的,我注意到iso文件被只读挂载,并且最终报错的输出是

This is a simple check of your environment.
Before installation, ensure that your machines meet all requirements specified at
https://github.com/kubesphere/kubekey#requirements-and-recommendations

Continue this installation? [yes/no]: yes
05:26:04 UTC success: [LocalHost]
05:26:04 UTC [UnArchiveArtifactModule] Check the KubeKey artifact md5 value
05:26:39 UTC success: [LocalHost]
05:26:39 UTC [UnArchiveArtifactModule] UnArchive the KubeKey artifact
05:26:39 UTC skipped: [LocalHost]
05:26:39 UTC [UnArchiveArtifactModule] Create the KubeKey artifact Md5 file
05:26:39 UTC skipped: [LocalHost]
05:26:39 UTC [RepositoryModule] Get OS release
05:26:39 UTC success: [master]
05:26:39 UTC success: [harbor]
05:26:39 UTC success: [node1]
05:26:39 UTC [RepositoryModule] Sync repository iso file to all nodes
05:26:39 UTC message: [master]
reset tmp dir failed: reset tmp dir failed: Failed to exec command: sudo -E /bin/bash -c "if [ -d /tmp/kubekey ]; then rm -rf /tmp/kubekey ;fi && mkdir -m 777 -p /tmp/kubekey"

当我尝试手动执行这些命令时,看到了很多的“无法删除”的提示,我尝试手动执行的命令是:

sudo -E /bin/bash -c "if [ -d /tmp/kubekey ]; then rm -rf /tmp/kubekey ;fi && mkdir -m 777 -p /tmp/kubekey"

在查看挂载点信息时我注意到:

/tmp/kubekey/ubuntu-22.04-amd64.iso (deleted) on /tmp/kubekey/iso type iso9660 (ro,relatime,nojoliet,check=s,map=n,blocksize=2048,iocharset=utf8)

最终解决这个报错我使用了

sudo umount /tmp/kubekey/iso
sudo rm -rf /tmp/kubekey

Relevant log output

This is a simple check of your environment.
Before installation, ensure that your machines meet all requirements specified at
https://github.com/kubesphere/kubekey#requirements-and-recommendations

Continue this installation? [yes/no]: yes
13:51:45 UTC success: [LocalHost]
13:51:45 UTC [UnArchiveArtifactModule] Check the KubeKey artifact md5 value
13:52:25 UTC success: [LocalHost]
13:52:25 UTC [UnArchiveArtifactModule] UnArchive the KubeKey artifact
13:52:25 UTC skipped: [LocalHost]
13:52:25 UTC [UnArchiveArtifactModule] Create the KubeKey artifact Md5 file
13:52:25 UTC skipped: [LocalHost]
13:52:25 UTC [RepositoryModule] Get OS release
13:52:25 UTC success: [master]
13:52:25 UTC success: [harbor]
13:52:25 UTC success: [node1]
13:52:25 UTC [RepositoryModule] Sync repository iso file to all nodes
13:52:30 UTC success: [node1]
13:52:30 UTC success: [master]
13:52:30 UTC success: [harbor]
13:52:30 UTC [RepositoryModule] Mount iso file
13:52:30 UTC success: [node1]
13:52:30 UTC success: [master]
13:52:30 UTC success: [harbor]
13:52:30 UTC [RepositoryModule] New repository client
13:52:30 UTC success: [node1]
13:52:30 UTC success: [master]
13:52:30 UTC success: [harbor]
13:52:30 UTC [RepositoryModule] Backup original repository
13:52:30 UTC message: [master]
backup repository failed: Failed to exec command: sudo -E /bin/bash -c "mv /etc/apt/sources.list /etc/apt/sources.list.kubekey.bak"
mv: cannot stat '/etc/apt/sources.list': No such file or directory: Process exited with status 1
13:52:31 UTC failed: [master]
13:52:31 UTC success: [node1]
13:52:31 UTC success: [harbor]
13:52:31 UTC rollback: [harbor]
13:52:31 UTC rollback: [master]
13:52:31 UTC rollback: [node1]
error: Pipeline[CreateClusterPipeline] execute failed: Module[RepositoryModule] exec failed:
failed: [master] [BackupOriginalRepository] exec failed after 1 retires: backup repository failed: Failed to exec command: sudo -E /bin/bash -c "mv /etc/apt/sources.list /etc/apt/sources.list.kubekey.bak"
mv: cannot stat '/etc/apt/sources.list': No such file or directory: Process exited with status 1
root@kube-offline-ctl:~#

Additional information

Image

skyhhjmk avatar Apr 11 '25 14:04 skyhhjmk

我要补充一个东西,我看了节点中的/etc/apt/sources.list,确实是不存在的,可能是由于之前的安装意外失败了导致的

skyhhjmk avatar Apr 11 '25 15:04 skyhhjmk

没有之前的日志记录的话,很难知道是什么问题。 根据代码来推测的话,可能是这几个步骤中的某一个步骤出现了问题,然后没有正确卸载挂载的文件。 https://github.com/kubesphere/kubekey/blob/63d81438f50f88d9115aa677c7749a9d59e481f6/cmd/kk/pkg/bootstrap/os/module.go#L276-L340

redscholar avatar Apr 16 '25 06:04 redscholar

非常感谢您提供的代码段,我看到了InstallPackage关键字,让我想起了另一个现象,当添加--with-packages时,会由执行脚本的节点在本地创建一个软件源,让其他节点从这个软件源来下载并安装软件,我注意到apt在最后的阶段卡死了,但是我并没有发现具体原因,看起来像(我在一台测试用的服务器上手动安装软件用来模拟当时的情景下发生的输出):

Setting up xxx ...
Processing triggers for xxx ...
Scanning processes...
Scanning linux images...

Running kernel seems to be up-to-date.

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor (qemu) binaries on this host.

当最后一行出现后,并没有出现命令行的交互提示root@kube:~#让我输入命令,而是卡住在这里,我尝试一直等待下去,但是直到SSH断开也没有反应,由于我在screen中执行的命令,当我等待了一天后重新连接screen时,它依然卡在这里,如果选择使用Ctrl C终止进程,再次执行时会发生由于没有正确卸载导致的错误

解决方法是手动清理文件、卸载目录后不添加--with-packages参数安装

skyhhjmk avatar Apr 16 '25 06:04 skyhhjmk

I encountered the same problem, when use --with-packages, it's Stuck.

Preparing to unpack .../ipvsadm_1.31-1build2_amd64.deb ...
Unpacking ipvsadm (1:1.31-1build2) ...........]
Setting up ipvsadm (1:1.31-1build2) ..........]
Setting up ebtables (2.0.11-4build2) .........]
Setting up libipset13:amd64 (7.15-1build1) ...]
Setting up ipset (7.15-1build1) ...######.....]
Processing triggers for man-db (2.12.0-4build2) ...
Processing triggers for libc-bin (2.39-0ubuntu8.4) ...
Scanning processes...
Scanning linux images...

Pending kernel upgrade!
Running kernel version:
  6.8.0-57-generic
Diagnostics:
  The currently running kernel version is not the
expected kernel version 6.8.0-58-generic.

Restarting the system to load the new kernel will
not be handled automatically, so you should
consider rebooting.

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor
 (qemu) binaries on this host.

gengmenglong avatar Apr 25 '25 08:04 gengmenglong

I encountered the same problem, when use --with-packages, it's Stuck.

Preparing to unpack .../ipvsadm_1.31-1build2_amd64.deb ...
Unpacking ipvsadm (1:1.31-1build2) ...........]
Setting up ipvsadm (1:1.31-1build2) ..........]
Setting up ebtables (2.0.11-4build2) .........]
Setting up libipset13:amd64 (7.15-1build1) ...]
Setting up ipset (7.15-1build1) ...######.....]
Processing triggers for man-db (2.12.0-4build2) ...
Processing triggers for libc-bin (2.39-0ubuntu8.4) ...
Scanning processes...
Scanning linux images...

Pending kernel upgrade!
Running kernel version:
  6.8.0-57-generic
Diagnostics:
  The currently running kernel version is not the
expected kernel version 6.8.0-58-generic.

Restarting the system to load the new kernel will
not be handled automatically, so you should
consider rebooting.

No services need to be restarted.

No containers need to be restarted.

No user sessions are running outdated binaries.

No VM guests are running outdated hypervisor
 (qemu) binaries on this host.

A temporary alternative is available. If you no longer wish to wait, terminate the process with the Ctrl + C keys, manually unmount the volume and clean up the leftovers as I mentioned, and then run it again without any parameters.

skyhhjmk avatar Apr 25 '25 08:04 skyhhjmk

Oh, tks. I think this may be related to 'Pending kernel upgrade'. But I can't verify it because after executing it in your way, the program can run normally, And when I reused the 'with packages' parameter, everything went smoothly. solution: https://askubuntu.com/questions/1349884/how-to-disable-pending-kernel-upgrade-message Image

gengmenglong avatar Apr 25 '25 09:04 gengmenglong