operator-lifecycle-manager icon indicating copy to clipboard operation
operator-lifecycle-manager copied to clipboard

Packageserver cant connect to the grpc server

Open cmoulliard opened this issue 5 years ago • 11 comments

Issue

The olm Packageserver cannot connect to the grpc server created from an image built using operator-registry

The following CatalogSource has been deployed successfully on kubernetes 1.15

apiVersion: operators.coreos.com/v1alpha1
kind: CatalogSource
metadata:
  name: prometheus-manifests
spec:
  displayName: Prometheus Operator
  publisher: Snowdrop
  sourceType: grpc
  image: quay.io/cmoulliard/olm-index:0.1.0

but no packagemanifests are created within the namespace demo

When I look to the packagerserver running as a pod within the olm namespace, I see this error

W0212 20:57:16.234189       1 clientconn.go:1120] grpc: 
addrConn.createTransport failed to connect to
 {prometheus-manifests.demo.svc:50051 0  <nil>}. 
Err :connection error: 
desc = "transport: Error while dialing dial tcp 10.109.228.141:50051:
 i/o timeout". Reconnecting...
I0212 20:57:16.234271       1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc000029dd0, TRANSIENT_FAILURE
I0212 20:57:17.239568       1 balancer_conn_wrappers.go:127] pickfirstBalancer: HandleSubConnStateChange: 0xc000029dd0, CONNECTING

A service resource has been well created to access it

kind: Service
apiVersion: v1
metadata:
  name: prometheus-manifests
  namespace: demo
  ownerReferences:
    - apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: prometheus-manifests
      uid: 0b863564-d8c0-471a-b3c7-63c1c1433153
      controller: false
      blockOwnerDeletion: false
spec:
  ports:
    - name: grpc
      protocol: TCP
      port: 50051
      targetPort: 50051
  selector:
    olm.catalogSource: prometheus-manifests
  clusterIP: 10.107.184.132
  type: ClusterIP
  sessionAffinity: None

Here is the pod resource created for the grpc server

kind: Pod
apiVersion: v1
metadata:
  name: prometheus-manifests-kdnmf
  generateName: prometheus-manifests-
  namespace: demo
  selfLink: /api/v1/namespaces/demo/pods/prometheus-manifests-kdnmf
  uid: a57ef224-9729-44c0-a591-c885bb1695e7
  resourceVersion: '15873'
  creationTimestamp: '2020-02-12T20:56:56Z'
  labels:
    olm.catalogSource: prometheus-manifests
  ownerReferences:
    - apiVersion: operators.coreos.com/v1alpha1
      kind: CatalogSource
      name: prometheus-manifests
      uid: 0b863564-d8c0-471a-b3c7-63c1c1433153
      controller: false
      blockOwnerDeletion: false
spec:
  volumes:
    - name: default-token-rzx22
      secret:
        secretName: default-token-rzx22
        defaultMode: 420
  containers:
    - name: registry-server
      image: 'quay.io/cmoulliard/olm-index:0.1.0'
      ports:
        - name: grpc
          containerPort: 50051
          protocol: TCP
      resources:
        limits:
          cpu: 100m
          memory: 100Mi
        requests:
          cpu: 10m
          memory: 50Mi
      volumeMounts:
        - name: default-token-rzx22
          readOnly: true
          mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      livenessProbe:
        exec:
          command:
            - grpc_health_probe
            - '-addr=localhost:50051'
        initialDelaySeconds: 10
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      readinessProbe:
        exec:
          command:
            - grpc_health_probe
            - '-addr=localhost:50051'
        initialDelaySeconds: 5
        timeoutSeconds: 1
        periodSeconds: 10
        successThreshold: 1
        failureThreshold: 3
      terminationMessagePath: /dev/termination-log
      terminationMessagePolicy: File
      imagePullPolicy: IfNotPresent
  restartPolicy: Always
  terminationGracePeriodSeconds: 30
  dnsPolicy: ClusterFirst
  nodeSelector:
    beta.kubernetes.io/os: linux
  serviceAccountName: default
  serviceAccount: default
  nodeName: k8s-115
  securityContext: {}
  schedulerName: default-scheduler
  tolerations:
    - operator: Exists
  priority: 0
  enableServiceLinks: true
status:
  phase: Running

If I ssh to the vm running the cluster, I can use the grpcurl tool

[root@k8s-115 ~]# grpcurl -plaintext 10.107.184.132:50051 list api.Registry
api.Registry.GetBundle
api.Registry.GetBundleForChannel
api.Registry.GetBundleThatReplaces
api.Registry.GetChannelEntriesThatProvide
api.Registry.GetChannelEntriesThatReplace
api.Registry.GetDefaultBundleThatProvides
api.Registry.GetLatestChannelEntriesThatProvide
api.Registry.GetPackage
api.Registry.ListPackages

but listPackages is empty

 grpcurl -plaintext 10.107.184.132:50051 api.Registry.ListPackages
[root@k8s-115 ~]# 

Additional info

kubernetes cluster: 1.15 olm version: 0.14.1 image index : quay.io/cmoulliard/olm-index:0.1.0 operator-registry: master

cmoulliard avatar Feb 12 '20 21:02 cmoulliard

Just to try to summarize:

You have a catalog being deployed into a namespace different than the namespace that the PackageServer is deployed into. The PackageServer is resolving the connection to the pod name prometheus-manifests.demo.svc at IP 10.109.228.141, the service though is setup on IP 10.107.184.132 and works just fine for hitting it with grpcurl.

So it appears PackageServer is trying to get to the pod directly rather than through the pod. If it was running in the same namespace as is default, that would work.

Does the Catalog Operator correctly access the catalog?

I wonder if you could actually setup the PackageServer in the same namespace and if that wouldn't solve your problem. I don't know if that would be an official solution, but might be a work-around.

flickerfly avatar Mar 04 '20 19:03 flickerfly

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar May 03 '20 20:05 stale[bot]

seeing the same issue on the OKD4 cluster I just installed today following https://medium.com/@craig_robinson/openshift-4-4-okd-bare-metal-install-on-vmware-home-lab-6841ce2d37eb

W0504 17:39:03.726845 1 clientconn.go:1120] grpc: addrConn.createTransport failed to connect to {community-operators.openshift-marketplace.svc:50051 0 }. Err :connection error: desc = "transport: Error while dialing dial tcp 172.30.183.136:50051: i/o timeout". Reconnecting...

gmarcy avatar May 04 '20 22:05 gmarcy

noticed this in case it helps

$ kubectl -n openshift-marketplace get event ... 14m Warning Unhealthy pod/community-operators-5b7f9bb9bf-b2v9v Readiness probe failed: timeout: failed to connect service "localhost:50051" within 1s 105s Warning Unhealthy pod/community-operators-5b7f9bb9bf-b2v9v Liveness probe failed: timeout: failed to connect service "localhost:50051" within 1s 15m Warning Unhealthy pod/community-operators-5b7f9bb9bf-b2v9v Readiness probe failed: command timed out 14m Warning Unhealthy pod/community-operators-5b7f9bb9bf-b2v9v Liveness probe failed: command timed out ...

gmarcy avatar May 05 '20 00:05 gmarcy

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jul 04 '20 01:07 stale[bot]

seeing the same issues with the readyness and liveness probes failing to contact localhost for the pods

Kampe avatar Jul 15 '20 05:07 Kampe

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Sep 13 '20 07:09 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Nov 13 '20 10:11 stale[bot]

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

stale[bot] avatar Jan 12 '21 12:01 stale[bot]

any update on above issue Im facing the same issue in Mac m1 with go-grpc consul setup

ghost avatar Mar 11 '21 15:03 ghost

+1 here. Added the operatorhub.io CatalogSource in openshift-marketplace ns inside an OCP cluster v4.18.11

Startup probe failed: timeout: failed to connect to service ":50051" within 1s The service name is missing, is there a specific name to use?

Thank you for your help

rectacoda avatar May 14 '25 13:05 rectacoda