cloud-on-k8s icon indicating copy to clipboard operation
cloud-on-k8s copied to clipboard

feat: Add package registry to eck

Open tehbooom opened this issue 4 months ago • 29 comments

Resolves #8925

Elastic Package Registry (EPR) has been highly requested to be added to ECK.

EPR does not have any references since it does not require a license nor any other application.

The following was implemented for EPR

  • defaults to TLS
  • Sets the default container image to docker.elastic.co/package-registry/distribution
  • Users can set their own images
  • Users can update the config following the reference
  • Kibana can reference the EPR like Elasticsearch and Enterprise Search
  • If Kibana references EPR and TLS is enabled it will populate xpack.fleet.registryUrl and set the environment variable NODE_EXTRA_CA_CERTS to the path of EPR's CA which is mounted
  • If a user provides their own NODE_EXTRA_CA_CERTS with a mount the controller will combine the certs appending the EPR's CA to the users specified CA

This was tested with and without setting NODE_EXTRA_CA_CERTS using the below manifest

apiVersion: epr.k8s.elastic.co/v1alpha1
kind: ElasticPackageRegistry
metadata:
  name: registry
spec:
  version: 9.1.2
  count: 1
  podTemplate:
    spec:
      containers:
      - name: package-registry
        image: docker.elastic.co/package-registry/distribution:lite-9.1.2
---
apiVersion: elasticsearch.k8s.elastic.co/v1
kind: Elasticsearch
metadata:
  name: elasticsearch
spec:
  version: 9.1.2
  nodeSets:
  - name: default
    count: 1
---
apiVersion: kibana.k8s.elastic.co/v1
kind: Kibana
metadata:
  name: kibana
spec:
  version: 9.1.2
  count: 1
  elasticsearchRef:
    name: elasticsearch
  packageRegistryRef:
    name: registry
  config:
    telemetry.optIn: false
    xpack.fleet.isAirGapped: true
    xpack.fleet.agents.elasticsearch.hosts: ["https://elasticsearch-es-http.default.svc:9200"]
    xpack.fleet.agents.fleet_server.hosts: ["https://fleet-server-agent-http.default.svc:8220"]
    xpack.fleet.packages:
      - name: system
        version: latest
      - name: elastic_agent
        version: latest
      - name: fleet_server
        version: latest
    xpack.fleet.agentPolicies:
      - name: Fleet Server on ECK policy
        id: eck-fleet-server
        namespace: default
        monitoring_enabled:
          - logs
          - metrics
        unenroll_timeout: 900
        package_policies:
        - name: fleet_server-1
          id: fleet_server-1
          package:
            name: fleet_server
  podTemplate:
    spec:
      containers:
      - name: kibana
        env:
        - name: NODE_EXTRA_CA_CERTS
          value: /custom/user/ca-bundle.crt
        volumeMounts:
        - name: custom-ca
          mountPath: /custom/user
          readOnly: true
      volumes:
      - name: custom-ca
        secret:
          secretName: user-custom-ca-secret
---
apiVersion: v1
kind: Secret
metadata:
  name: user-custom-ca-secret
  namespace: default
type: Opaque
data:
  ca-bundle.crt: LS0tLS1CRUdJTiBDRVJUSUZJQ0FURS0tLS0tCk1JSUZtVENDQTRHZ0F3SUJBZ0lVYjVrK2d6V3A5YjljWTV4bkhUcWZNdHFHUXIwd0RRWUpLb1pJaHZjTkFRRUwKQlFBd1hERUxNQWtHQTFVRUJoTUNXRmd4RlRBVEJnTlZCQWNNREVSbFptRjFiSFFnUTJsMGVURWNNQm9HQTFVRQpDZ3dUUkdWbVlYVnNkQ0JEYjIxd1lXNTVJRXgwWkRFWU1CWUdBMVVFQXd3UGRHVnpkQzVsYkdGemRHbGpMbU52Ck1CNFhEVEkxTURneU1ERTRNakl3T0ZvWERUTTFNRGd4T0RFNE1qSXdPRm93WERFTE1Ba0dBMVVFQmhNQ1dGZ3gKRlRBVEJnTlZCQWNNREVSbFptRjFiSFFnUTJsMGVURWNNQm9HQTFVRUNnd1RSR1ZtWVhWc2RDQkRiMjF3WVc1NQpJRXgwWkRFWU1CWUdBMVVFQXd3UGRHVnpkQzVsYkdGemRHbGpMbU52TUlJQ0lqQU5CZ2txaGtpRzl3MEJBUUVGCkFBT0NBZzhBTUlJQ0NnS0NBZ0VBMHljTGVySWR3LzdpbGlKMzVBUEZ4bUx6TFRnNWRhUStWSUttS2lNbStlTTYKanJOY3lnbGphNVFEbHYvMStGUm5hamhrRTBobHoycXEzTjk0U1pYN3M2eHBnQUVzMGVQQ3VaZVBNU2VUYlYyRgp0YlIxNnFuM0JjenVxN3laOXZwdHR3MmJRdkJkY3JzZFU4T2RYUWhGNFd4QUFwODRKYWlMNmkzMlA2K2VPODBwCmh3Z1kwS0F1bzZoZC8zaFpNME14M2MwRmJmU0JHaTUyOHZKODYzUDRXZlEwMWdtUUxVbGl0UlhhTUhiaDRXSm0KOU45c0psUXpnbkNuQjZ6YkZjZ2gweWxrakd0UzBIZEo3eSs3dmE0Q1BqdkxlWGpwTnZuQzRjTmlocnp4Wmw5bQphM0ZVdVpiU0lRekE2ZFlkdkdrT2V3OTJEek1BaTdldU14UDdyYVhRejZmc1N6U1V4N1RjQWl5M2E5VU9Fdi9rCk5NV3VTbDlUMHRRSkhJSzJMc0t0MlVKWVVHWk4wOWU2SUVSTlJOL0FIUjVDbTlhcVQ1Q2ZyQW9JVVhNdUg2S1oKN1JCZFFockRxL2xEQk54bWs5dW44V2lic0NSVnkvVXRJQ3lOSytxbGpGUWZEd01hNkRkd3BjcnpnTWZnU3RTawpLek1LRUJla2N0Q0Q4dHNmTjZYem5USmNBYUJETzFlQWZyT0Z2NG1PTXJqVG90OEYvK3pxN0dXNTlqWTRvdFhMCkY3TnpadFl0eWsvbDRvb2hUZUFuM1ptd1BDMGJFQ1FkTmpTVkZ6ZXJCamE4ZjhacGpKRzNjUllyVmh6YUNsRWMKRU5wbFRHcldVaUVwRDdnTnNlNWNDSnZpQU12NHdwait2QTVVNlA3Z0MxUUtKV2hWS3BVYWcvTmtTSUFCRmtrQwpBd0VBQWFOVE1GRXdIUVlEVlIwT0JCWUVGTWdldEVJajZtRWdsZURGNkVNdUY4NXVnYzdZTUI4R0ExVWRJd1FZCk1CYUFGTWdldEVJajZtRWdsZURGNkVNdUY4NXVnYzdZTUE4R0ExVWRFd0VCL3dRRk1BTUJBZjh3RFFZSktvWkkKaHZjTkFRRUxCUUFEZ2dJQkFEOFU3dm1yWmhHTUZiV2YzRDZlNy84TUwzWEhLRk5TNy9UeWF3U2tvdGVSTVdFbgp1RWhQK2dmbkdUT2ZITFlQeHl5eEJ4U041T29sZHRJclo5dnhBc2dlYWJzSkJaenhQVHpxU09VN3h3b09LcTlRCmdKRUYxL0ZmemFlR1V5dVE2S1ZaZ0QvZ1JPSW42Ri9OUGlzM1pvbUpPOStuVWdTTnNiUm9RYmdPUGdPV3Q3Z1gKVEhuOHJpdUp2OXRPNFBRN09Sa3pubDJYbERlcE9xNVpwSUtkcVl0Rm5MUjF3SllyREZESmt0Q3h6MzFob0FrZwpSVjlSU1BSMFFxZ1JQeFNpNGpXdkNGUk5XTUFJc0NadGJsWExRRUljWGI1YnlsWXV2a3psTTJ4dHlHK3FaRFhMCnFoZDVNeFZIUkpqTzE1VEdpZXFRcUpMVkZyVElhTHFoaXZpQ1pUbDJoVkYxVlpPVG05MU5aeE53M25RL3JyeDgKK2VQV2xTWlZKWXc3SDRkWkx5WTFjRUxLT0YrZDJybVNSZ2pWaHZycUZ3R1M3MUQzYkV4Y0dSakNrOHNQWEZyRwpsOFRzY05RMXBPSGVuNlJhOFhVdGtxU1doZllFb3owZjBEem4wYmt4c2VWaCttS1BHV3QxcHdlemVFTFVwaHE3CmwwSVRLeis1b1lqYWVHTDRia25kcWlpemwzWkc2N0lYL3VyR0dQVUxkLzU1NEtRMFFPMS92S3Y2dE1YMWc0dVMKWHdWc0pzQjlrTUIwRFFxbDhRYmg0UEJ2ZW9RRTZvL3BycXRtWjR1RWdDMCt1cm5paDlCY1FweFNKOUljR1kxTQpBQzRBcG5Pem1CYTFhUVBMcDRaRFIxQXpFK1hXWDd2WWNWYUxleUJxRzRja3dwbUtOUnhpcnJjS2NaMkYKLS0tLS1FTkQgQ0VSVElGSUNBVEUtLS0tLQo=
---
apiVersion: agent.k8s.elastic.co/v1alpha1
kind: Agent
metadata:
  name: fleet-server
spec:
  version: 9.1.2
  kibanaRef:
    name: kibana
  elasticsearchRefs:
  - name: elasticsearch
  mode: fleet
  fleetServerEnabled: true
  policyID: eck-fleet-server
  deployment:
    replicas: 1
    podTemplate:
      spec:
        serviceAccountName: fleet-server
        automountServiceAccountToken: true
        resources:
          requests:
            cpu: 200m
            memory: 1Gi
          limits:
            cpu: 1
            memory: 2Gi
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata:
  name: fleet-server
  namespace: default
rules:
- apiGroups: [""]
  resources:
  - pods
  - namespaces
  - nodes
  verbs:
  - get
  - watch
  - list
- apiGroups: ["apps"]
  resources:
    - replicasets
  verbs:
    - get
    - watch
    - list
- apiGroups: ["batch"]
  resources:
    - jobs
  verbs:
    - get
    - watch
    - list
- apiGroups: ["coordination.k8s.io"]
  resources:
  - leases
  verbs:
  - get
  - create
  - update
---
apiVersion: v1
kind: ServiceAccount
metadata:
  name: fleet-server
  namespace: default
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
  name: fleet-server
  namespace: default
subjects:
- kind: ServiceAccount
  name: fleet-server
  namespace: default
roleRef:
  kind: ClusterRole
  name: fleet-server
  apiGroup: rbac.authorization.k8s.io

tehbooom avatar Aug 20 '25 20:08 tehbooom

:white_check_mark: Snyk checks have passed. No issues have been found so far.

Status Scanner Critical High Medium Low Total (0)
:white_check_mark: Open Source Security 0 0 0 0 0 issues
:white_check_mark: Licenses 0 0 0 0 0 issues

:computer: Catch issues earlier using the plugins for VS Code, JetBrains IDEs, Visual Studio, and Eclipse.

prodsecmachine avatar Aug 20 '25 20:08 prodsecmachine

🔍 Preview links for changed docs

github-actions[bot] avatar Aug 20 '25 20:08 github-actions[bot]

For some reason the container does not seem to listen on 8080 and is killed by the kubelet:

I have the same problem when I try to run the e2e test TestElasticPackageRegistryStandalone:

Containers:
  package-registry:
    Container ID:   containerd://964dc18c8a1b461f7bba941dfc688af4375ce6e00a247f3326cd614622b74b89
    Image:          docker.elastic.co/package-registry/distribution:9.0.5
    Image ID:       docker.elastic.co/package-registry/distribution@sha256:15edf005ee2cb3a9611e1dae535c506134de98f6e90eaaa8419a3161b4c7b858
    Port:           8080/TCP
    Host Port:      0/TCP
    State:          Running
      Started:      Tue, 26 Aug 2025 12:41:15 +0200
    Last State:     Terminated
      Reason:       Error
      Exit Code:    2
      Started:      Tue, 26 Aug 2025 12:38:55 +0200
      Finished:     Tue, 26 Aug 2025 12:41:15 +0200
    Ready:          False
    Restart Count:  1
[...]
  Normal   Killing    2m3s                  kubelet            Container package-registry failed startup probe, will be restarted
  Warning  Unhealthy  3s (x4 over 2m23s)    kubelet            Startup probe failed: Get "https://10.31.86.16:8080/health": dial tcp 10.31.86.16:8080: connect: connection refused

barkbay avatar Aug 26 '25 10:08 barkbay

@barkbay it takes a long time (several minutes) for the EPR to start. Can it be your issue?

jeanfabrice avatar Aug 26 '25 12:08 jeanfabrice

@barkbay it takes a long time (several minutes) for the EPR to start. Can it be your issue?

I'm using the default startup probe set by the controller in this PR:

    startupProbe:
      failureThreshold: 3
      httpGet:
        path: /health
        port: 8080
        scheme: HTTPS
      initialDelaySeconds: 120
      periodSeconds: 10
      successThreshold: 1
      timeoutSeconds: 5

I can try to increase/remove it..

barkbay avatar Aug 26 '25 12:08 barkbay

@barkbay it takes a long time (several minutes) for the EPR to start. Can it be your issue?

Good catch @jeanfabrice ! It took something like 5 minutes for the packages to be "loaded", the default probe is maybe a bit too optimistic, the container is prematurely killed.

barkbay avatar Aug 26 '25 12:08 barkbay

@barkbay it takes a long time (several minutes) for the EPR to start. Can it be your issue?

Good catch @jeanfabrice ! It took something like 5 minutes for the packages to be "loaded", the default probe is maybe a bit too optimistic, the container is prematurely killed.

It 100% is the startup probe time. I had a lot of issues with this and at one point had it set at 10 minutes. I think 7 minutes might be a good in between? The problem arises when a user runs the production image which is around 10+ GB and takes upwards of 10 minutes to start.

tehbooom avatar Aug 26 '25 12:08 tehbooom

@barkbay With the startup taking anywhere from 2-10+ minutes do we want to set the initial delay to still be 120 seconds but increase the failure threshold, and timeoutseconds to account for the 10+ minutes?

Something like this where the max amount of time is 10 minutes (600 seconds)

// startupProbe is the startup probe for the packageregistry container
func startupProbe(useTLS bool) corev1.Probe {
	scheme := corev1.URISchemeHTTP
	if useTLS {
		scheme = corev1.URISchemeHTTPS
	}
	return corev1.Probe{
		FailureThreshold:    16,
		InitialDelaySeconds: 120,
		PeriodSeconds:       10,
		SuccessThreshold:    1,
		TimeoutSeconds:      30,
		ProbeHandler: corev1.ProbeHandler{
			HTTPGet: &corev1.HTTPGetAction{
				Port:   intstr.FromInt(HTTPPort),
				Path:   "/health",
				Scheme: scheme,
			},
		},
	}
}

tehbooom avatar Aug 26 '25 13:08 tehbooom

It 100% is the startup probe time. I had a lot of issues with this and at one point had it set at 10 minutes. I think 7 minutes might be a good in between?

I think my question would be "why do we need a startup probe"? Maybe a readiness probe is enough?

barkbay avatar Aug 26 '25 13:08 barkbay

It 100% is the startup probe time. I had a lot of issues with this and at one point had it set at 10 minutes. I think 7 minutes might be a good in between?

I think my question would be "why do we need a startup probe"? Maybe a readiness probe is enough?

Ahh yes I guess the application has already started therefore a startup probe doesnt make sense here.

tehbooom avatar Aug 26 '25 13:08 tehbooom

@tehbooom 👋 please let me know when you need another review, thanks!

barkbay avatar Sep 08 '25 06:09 barkbay

@tehbooom 👋 please let me know when you need another review, thanks!

@barkbay

Added EPR to ECK diagnostics here.

I still need to update our documentation. I know we changed how we do documentation so if you could please point me in the right direction where ECK docs live that would be great. This PR is ready for another review. Thanks!

tehbooom avatar Sep 08 '25 13:09 tehbooom

I still need to update our documentation. I know we changed how we do documentation so if you could please point me in the right direction where ECK docs live that would be great. This PR is ready for another review. Thanks!

Documentation has moved here: https://github.com/elastic/docs-content Please note that the main branch of that repo is immediately published, therefore do not merge the doc PR until the feature has been released.

I'll try to take another look to your PR this week, I'm struggling to keep up with the pace of PRs opened in this repo 😅

barkbay avatar Sep 15 '25 06:09 barkbay

buildkite test this -f p=gke,E2E_TAGS=epr

barkbay avatar Sep 16 '25 09:09 barkbay

buildkite test this -f p=gke,E2E_TAGS=epr

barkbay avatar Sep 16 '25 10:09 barkbay

E2E tests are still failing with:

  Warning  Failed               12m   kubelet            Failed to pull image "docker.elastic.co/package-registry/distribution:9.1.2": failed to pull and unpack image "docker.elastic.co/package-registry/distribution:9.1.2": failed to extract layer sha256:57bbe197467b8b19ba0705f05ee41860ff3bb44020ed5986df96bfed4614e630: write /var/lib/containerd/io.containerd.snapshotter.v1.overlayfs/snapshots/396/fs/packages/package-storage/security_detection_engine-8.6.8.zip: no space left on device

We also have to find a solution for this. IIUC https://github.com/elastic/package-registry/pull/1335 would be the way to go? I'll also try to check if we can increase the disk on GKE nodes, but it would still mean that we may have to skip other providers (AWS, Azure, Kind...).

IIUC the image currently requires ~14Gi:

docker.elastic.co/package-registry/distribution                                       9.1.2                                             d127b26dc3000       13.8GB

barkbay avatar Sep 16 '25 10:09 barkbay

buildkite test this -f p=gke,E2E_TAGS=epr

pebrc avatar Sep 25 '25 10:09 pebrc

buildkite test this -f p=gke,E2E_TAGS=epr

tehbooom avatar Oct 16 '25 12:10 tehbooom

Main blocker to merge this imo is the lack of UBI images for the package registry.

This blocker has been addressed in https://github.com/elastic/package-registry/pull/1451, which is now merged.

naemono avatar Nov 05 '25 14:11 naemono

Main blocker to merge this imo is the lack of UBI images for the package registry.

This blocker has been addressed in elastic/package-registry#1451, which is now merged.

In this PR we are using the Package Registry distribution images. To support UBI there we would also need to update https://github.com/elastic/package-storage-infra/blob/13bf4e9ba03c028b16ed37772cd0d1afaa45af4f/.buildkite/scripts/build_distributions.sh.

jsoriano avatar Nov 05 '25 17:11 jsoriano

Main blocker to merge this imo is the lack of UBI images for the package registry.

This blocker has been addressed in elastic/package-registry#1451, which is now merged.

In this PR we are using the Package Registry distribution images. To support UBI there we would also need to update https://github.com/elastic/package-storage-infra/blob/13bf4e9ba03c028b16ed37772cd0d1afaa45af4f/.buildkite/scripts/build_distributions.sh.

The beginnings of this needed PR are here @jsoriano

naemono avatar Nov 13 '25 19:11 naemono

Definitely seeing some issues testing on openshift, but it doesn't seem like it's ocp specific:

{"log.level":"error","@timestamp":"2025-11-20T17:44:02.025Z","log.logger":"manager.eck-operator","message":"Reconciler error","service.version":"9.3.0-SNAPSHOT+","service.type":"eck","ecs.version":"1.4.0","controller":"packageregistry-controller","object":{"name":"registry","namespace":"elastic"},"namespace":"elastic","name":"registry","reconcileID":"8a30e396-91e2-4a86-a9b3-79368a4032a6","error":"services \"registry-epr-http\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>","errorCauses":[{"error":"services \"registry-epr-http\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"}],"error.stack_trace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/root/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:474\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/root/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:421\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func1.1\n\t/root/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:296"}

naemono avatar Nov 20 '25 19:11 naemono

buildkite test this -f p=gke,E2E_TAGS=epr

naemono avatar Nov 20 '25 20:11 naemono

Definitely seeing some issues testing on openshift, but it doesn't seem like it's ocp specific:

{"log.level":"error","@timestamp":"2025-11-20T17:44:02.025Z","log.logger":"manager.eck-operator","message":"Reconciler error","service.version":"9.3.0-SNAPSHOT+","service.type":"eck","ecs.version":"1.4.0","controller":"packageregistry-controller","object":{"name":"registry","namespace":"elastic"},"namespace":"elastic","name":"registry","reconcileID":"8a30e396-91e2-4a86-a9b3-79368a4032a6","error":"services \"registry-epr-http\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>","errorCauses":[{"error":"services \"registry-epr-http\" is forbidden: cannot set blockOwnerDeletion if an ownerReference refers to a resource you can't set finalizers on: , <nil>"}],"error.stack_trace":"sigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).reconcileHandler\n\t/root/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:474\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).processNextWorkItem\n\t/root/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:421\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller[...]).Start.func1.1\n\t/root/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:296"}

Nope, it's ocp specific: https://github.com/elastic/cloud-on-k8s/pull/8800/commits/49b1e56493bcb567a61bbeec229598edfba2c6b3. (was missing packageregistries/finalizers RBAC permissions)

naemono avatar Nov 20 '25 20:11 naemono

Nope, it's ocp specific: 49b1e56. (was missing packageregistries/finalizers RBAC permissions)

And more fun on ocp:

{"log.level":"info","@timestamp":"2025-11-20T20:49:24.212Z","log.logger":"manager.eck-operator","message":"would violate PodSecurity \"restricted:latest\": runAsNonRoot != true (pod or container \"package-registry\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"package-registry\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")","service.version":"3.3.0-rc1-SNAPSHOT+","service.type":"eck","ecs.version":"1.4.0","controller":"packageregistry-controller","object":{"name":"registry","namespace":"elastic"},"namespace":"elastic","name":"registry","reconcileID":"2521d32a-cdc5-4c36-ba90-64011f78d67b"}

naemono avatar Nov 20 '25 20:11 naemono

And more fun on ocp:

And we don't set runasuser/runasgroup for any CRD either:

containers[0].runAsUser: Invalid value: 1000: must be in the ranges: [1000730000, 1000739999]

I'm fixing all of these issues....

naemono avatar Nov 20 '25 22:11 naemono

Nope, it's ocp specific: 49b1e56. (was missing packageregistries/finalizers RBAC permissions)

And more fun on ocp:

{"log.level":"info","@timestamp":"2025-11-20T20:49:24.212Z","log.logger":"manager.eck-operator","message":"would violate PodSecurity \"restricted:latest\": runAsNonRoot != true (pod or container \"package-registry\" must set securityContext.runAsNonRoot=true), seccompProfile (pod or container \"package-registry\" must set securityContext.seccompProfile.type to \"RuntimeDefault\" or \"Localhost\")","service.version":"3.3.0-rc1-SNAPSHOT+","service.type":"eck","ecs.version":"1.4.0","controller":"packageregistry-controller","object":{"name":"registry","namespace":"elastic"},"namespace":"elastic","name":"registry","reconcileID":"2521d32a-cdc5-4c36-ba90-64011f78d67b"}

All of the issues running in Openshift/OCP-style clusters have been resolved, and verified. I'm waiting to verify UBI images specifically once they are built/pushed and this should be getting closer to a merging state.

naemono avatar Nov 24 '25 16:11 naemono

All of the issues running in Openshift/OCP-style clusters have been resolved, and verified. I'm waiting to verify UBI images specifically once they are built/pushed and this should be getting closer to a merging state.

The UBI images seem to run without issue:

                  openshift.io/scc: restricted-v2
                  packageregistry.k8s.elastic.co/config-hash: 2422330696
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
                  security.openshift.io/validated-scc-subject-type: user
Status:           Running
SeccompProfile:   RuntimeDefault
IP:               10.129.2.98
IPs:
  IP:           10.129.2.98
Controlled By:  ReplicaSet/registry-epr-858f669ff
Containers:
  package-registry:
    Container ID:    cri-o://a1e3ce5cf092d7b636a9d24b08ef6bd2d93e45685dcb3d01b4a6bf872a51db79
    Image:           docker.elastic.co/package-registry/distribution:lite-ubi

I think the one final change is to ensure that in an ocp environment we are using the ubi images by default. This seem to differ from the standard stack images which are UBI by default from 9.x forward. I'll make the changes and verify.

naemono avatar Nov 24 '25 17:11 naemono

All of the issues running in Openshift/OCP-style clusters have been resolved, and verified. I'm waiting to verify UBI images specifically once they are built/pushed and this should be getting closer to a merging state.

The UBI images seem to run without issue:

                  openshift.io/scc: restricted-v2
                  packageregistry.k8s.elastic.co/config-hash: 2422330696
                  seccomp.security.alpha.kubernetes.io/pod: runtime/default
                  security.openshift.io/validated-scc-subject-type: user
Status:           Running
SeccompProfile:   RuntimeDefault
IP:               10.129.2.98
IPs:
  IP:           10.129.2.98
Controlled By:  ReplicaSet/registry-epr-858f669ff
Containers:
  package-registry:
    Container ID:    cri-o://a1e3ce5cf092d7b636a9d24b08ef6bd2d93e45685dcb3d01b4a6bf872a51db79
    Image:           docker.elastic.co/package-registry/distribution:lite-ubi

I think the one final change is to ensure that in an ocp environment we are using the ubi images by default. This seem to differ from the standard stack images which are UBI by default from 9.x forward. I'll make the changes and verify.

The suffix should handle this when --ubi-only is set. I believe this is how we normally handle this in other controllers.

https://github.com/elastic/cloud-on-k8s/pull/8800/files#diff-52e0749d4ea9659ff8934fe1491cc88fc5508988f026b1ca8a0704e3a75da924R107-R111

naemono avatar Dec 08 '25 02:12 naemono

Edit: Nevermind, it's because I used short lived certificates to test certificate rotation, and it seems it requires the Pod to be recreated. @tehbooom Could you confirm that certificates are not hot reloaded?

@barkbay I believe that the certificates are not hot reloaded, quoting documentation

The NODE_EXTRA_CA_CERTS environment variable is only read when the Node.js process is first launched.

Looking also at the code it seems to me that the env var and the contents of the path (if the former is set) are read once code link

pkoutsovasilis avatar Dec 16 '25 13:12 pkoutsovasilis