fleet icon indicating copy to clipboard operation
fleet copied to clipboard

[SURE-8882] extend our testing to ssh helm chart downloads with keys

Open kkaempf opened this issue 1 year ago • 3 comments

Acceptance Criteria

  • [ ] write a test for Helm chart downloads via SSH as in https://github.com/rancher/fleet/blob/43cf0a41330c57c3d1b853e00ab66ab2c1899d6a/internal/bundlereader/loaddirectory.go#L242C2-L270C3

SURE-8882

Issue description:

The customer upgraded from Rancher 2.8.2 to Rancher 2.8.5 and some of their upstream fleet jobs are getting this error:

time=2024-07-30 15:16:18.000000 level=fatal msg="error downloading 'ssh://[email protected]/xxxxx/fleet-platform.git?sshkey=redacted': /usr/bin/git exited with 128: Cloning into '/tmp/getter624252719/temp'...\nNo user exists for uid 1000\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"

Troubleshooting steps:

The customer  tried changing the credentials and still get the same error. They are able to clone the repository locally using the same credentials supplied to Fleet. This also happens on most of the configured repositories, not just one or two git repos. They are able to exec into the gitjob pod and manually clone the repo with success. Checked from inside the GitJob pod:

 > kubectl exec -n cattle-fleet-system gitjob-7889c69f49-5kq8r -it -- cat /etc/passwd
root:x:0:0:root:/root:/bin/bash
sshd:x:499:486:SSH daemon:/var/lib/sshd:/usr/sbin/nologin
gitjob:x:1000:1000::/home/gitjob:/bin/bash

I reviewed two of their GitRepo manifests (working/not-working) and they are literally pointing to the same repo, the only difference is the path used. The customer dowgraded from 0.9.5 to 0.9.0 and the problem repos started to sync again

Repro steps:

unable to repro in-house

Workaround:

Is a workaround available and implemented? yes/no What is the workaround: Downgrade fleet

Actual behavior:

After upgrade, some gitrepos fail with error:

time=2024-07-30 15:16:18.000000 level=fatal msg="error downloading 'ssh://[email protected]/xxxxx/fleet-platform.git?sshkey=redacted': /usr/bin/git exited with 128: Cloning into '/tmp/getter624252719/temp'...\nNo user exists for uid 1000\r\nfatal: Could not read from remote repository.\n\nPlease make sure you have the correct access rights\nand the repository exists.\n"

Expected behavior: All gitrepos continue to sync with no error

Files, logs, traces:  

Additional notes:

kkaempf avatar Aug 16 '24 15:08 kkaempf

We're facing the same issue here. SSH doesn't work at all when pulling helm charts. It might be related to https://github.com/rancher/fleet/commit/326ad93d73c70a48b4ed2e1bee4fe955d83491f4, a quick Google search suggests that OpenSSH depends on the entry existing at /etc/passwd for some obscure reason.

Amusingly, using SSH for the GitRepo itself does work (the initcontainer runs fleet gitcloner and exits with success). It only seems to be an issue when running fleet apply.

We're using Rancher 2.9.3 with fleet 0.10.4

cienijr avatar Nov 06 '24 03:11 cienijr

We're facing the same issue here. SSH doesn't work at all when pulling helm charts. It might be related to 326ad93, a quick Google search suggests that OpenSSH depends on the entry existing at /etc/passwd for some obscure reason.

Amusingly, using SSH for the GitRepo itself does work (the initcontainer runs fleet gitcloner and exits with success). It only seems to be an issue when running fleet apply.

If downloading a helm chart via ssh fails, that would be a separate issue. The fleet apply CLI uses go-getter to download charts. Cloning the git repository is done by fleet gitcloner and it uses go-git.

manno avatar Nov 15 '24 10:11 manno

I looked up the source code for go-getter, and its support for fetching from Git repos indeed relies on running git commands directly, which would trigger the OpenSSH error related to the missing /etc/passwd entry.

From what I could gather, go-git has a Git implementation of its own and it uses crypto/ssh for transport instead of OpenSSH - I'm pretty sure that it does not perform this validation, that seems to be the culprit for this weird requirement.

That's probably why this problem happens when fleet downloads a chart through Git+SSH but not when it fetches a GitRepo using Git+SSH - in our case, even when using the same repo and same credentials.

(I apologize if any of my messages comes out as confusing or maybe rude, as English is not my native language)

cienijr avatar Nov 15 '24 17:11 cienijr

Hello, any update on this issue?? We are getting the same error in Fleet version v0.10.4

antonio-qualtio avatar May 13 '25 06:05 antonio-qualtio

/backport v2.11.4

manno avatar Jul 09 '25 13:07 manno

QA Template

Reproducing the issue

  1. Create a private git repository that can be accessed using an SSH key and can contain any helm chart to deploy.
  2. Create a secret that contains the key to access this repository.
    kubectl -n fleet-local create secret generic ssh --type "kubernetes.io/ssh-auth" \
        --from-file=ssh-privatekey=./../../keys/deploy
    
    In this example, we've created an SSH key named ssh.
  3. Create a public git repository that contains a fleet.yaml file. The fleet.yaml has to reference the private git repository like so:
     helm
         chart: git::ssh://[email protected]/p-se/fleet-devel.git//issues/SURE-8882/foo?ref=master
    
    where //issues/SURE-8882/foo?ref=master is optional and provides a path inside the git repository and the branch to be used to clone from.
  4. Have the GitRepo resource point to the public git repository with the fleet.yaml file. For example:
    kind: GitRepo
    apiVersion: fleet.cattle.io/v1alpha1
    metadata:
      name: sure-8882
      namespace: fleet-local
    spec:
      repo: [email protected]:p-se/fleet-devel.git
      branch: master
      paths:
        - issues/SURE-8882/repro
      clientSecretName: ssh
    
  5. Apply the GitRepo resource onto the cluster.
  6. You should see the No user exists for uid 1000 message in the status of the GitRepo.

Notes

The public repository can also be private, but doesn't have to. By using two private repositories and pointing the resource to the appropriate paths, you can use a single git repository for testing. That's where the provided examples come from.

p-se avatar Jul 10 '25 11:07 p-se

Verified on Rancher v2.12-6d2a9d53b44f309eac233e0f21fd9d5e806b056d-head with Fleet 0.13.0-beta.3 Added test case on Fleet case: https://app.qase.io/case/FLEET-184


Reproduced issue before fix:

Image

Testing steps:

  • Set 2 repos, 1 public with ssh secret to be used pointing a private one and 1 private with the actual chart

Public with : https://github.com/fleetqa/fleet-qa-examples-public

kind: GitRepo
apiVersion: fleet.cattle.io/v1alpha1
metadata:
  name: sure-8882
  namespace: fleet-local
spec:
  repo: [email protected]:fleetqa/fleet-qa-examples-public.git
  branch: main
  paths:
    - sure-8882/reproducer
  clientSecretName: ssh

and fleet.yaml pointitng to the ssh chart:

 helm:
     chart: git::ssh://[email protected]/fleetqa/fleet-qa-examples.git//helmchart-configmap?ref=main

Private repo with the actual chart (configmap chart): https://github.com/fleetqa/fleet-qa-examples/tree/main/helmchart-configmap

apiVersion: v1
kind: ConfigMap
metadata:
  name: foo
  namespace: default
data:
  key: value
apiVersion: v2
name: foo
description: A Helm chart for Kubernetes

# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "1.16.0"
  • Create Deploy Key on private repository

  • Upload secret via UI (passing deploy key as ssh)

apiVersion: v1
data:
  ssh-privatekey: " add private secret"
kind: Secret
metadata:
  creationTimestamp: null
  name: ssh
  namespace: fleet-local
type: kubernetes.io/ssh-auth
  • Add repo:

https://github.com/fleetqa/fleet-qa-examples-public.git main sure-8882/reproducer helm ssh key: ssh

  • Gitrepo is deployed: Image

mmartin24 avatar Jul 14 '25 06:07 mmartin24

Does this work for a private repo for the GitRepo as well as the repo path for the helm chart in fleet.yaml? I was getting the uid 1000 on Rancher 2.11.3. Now retesting on Rancher 2.11.4, following the above QA, I'm getting the error:

Bcrypt_pbkdf: empty password:Job Failed. failed: 1/1time="2025-08-04T19:42:42Z" level=fatal msg="failed to create auth from options for repo=\"[email protected]:myorg/myrepo.git\" branch=\"main\" revision=\"\" path=\"/workspace\": bcrypt_pbkdf: empty password"

krumware avatar Aug 04 '25 19:08 krumware

Scratch that, it's working. Thanks!

In case others stumble on this, a reminder:

  • ssh keys must be generated in PEM format.
  • ssh keys must NOT use a passphrase
  • ssh keys in PEM do NOT support ed25519 as recommended by github
  1. ssh-keygen -t rsa -b 4096 -m pem -C "[email protected]"
  2. leave passphrase empty
  3. create kubernetes secret from private key
kubectl -n fleet-local create secret generic ssh --type "kubernetes.io/ssh-auth" \
    --from-file=ssh-privatekey=path/to/privatekey
  1. cat path/to/publickey - add to github.

krumware avatar Aug 04 '25 21:08 krumware

Thanks @krumware - do we need to fix our documentation ? 🤔

kkaempf avatar Aug 05 '25 08:08 kkaempf

I think maybe doing a passthrough on that portion of the docs would be helpful. (and maybe it's there elsewhere, but most troubleshooting and searching took me straight to the fleet and gitrepo reference pages.)

My struggles were:

  • Finding a clear example of a workflow where both a gitrepo.yaml and a fleet.yaml are authenticated. The documentation tends to lean on at least one of these being unauthenticated. For gitops newbies it creates a challenge when internal repos will most likely always be private.
  • The (necessary) re-use of the word "repository" creates confusion because the authentication for a git repository and a helm repository are different. I wish there was a way to more obviously separate the two.
  • It was challenging to discover that the fleet yaml's authentication method is inherited from the method used by the gitrepo yaml
  • It's still unclear whether different authentication mechanisms can be used for a gitrepo yaml versus a fleet yaml. (it seems, for example, that a user would need to use the same ssh key for potential physically separated repository servers. - an area where enhancement might be coming with the upstream secrets support)
  • Despite warning sections indicating that certain legacy algorithms are required, and that passwords aren't supported, those notes didn't feel cohesive with the rest of the documentation and were not obvious.
  • It was pointed out in the Rancher slack (thank you!) about the go-getter urls being used. But that really did not stand out in the documentation which made things feel a little bit too automagic, so there was lots of experimentation needed to get to a working configuration.
  • The error messages in the Rancher CI UI were not found in any troubleshooting documentation. Such as the bcrypt_pbkdf empty password. I was using a password-protected key, but could not find any way to pass that in, so it embarrassingly took me way too long to find the remark about passwords not being supported.
  • Finally, documentation on Github recommends against using deprecated algorithms which are required for fleet (at this point). So that unforunately adds to the complexity for adopters, who then need to troubleshoot why their generated key for a private github repository doesn't work.

I hope that's helpful!

krumware avatar Aug 06 '25 14:08 krumware

@kakabisht 👆🏻

kkaempf avatar Aug 07 '25 12:08 kkaempf