rook Ceph multisite realm pull fails but works if run manually

Is this a bug report or feature request?

Bug Report

Deviation from expected behavior: When adding a pull endpoint to the realm resource it fails to pull the realm

Expected behavior: It successfully pulls the realm

How to reproduce it (minimal and precise): One older rook ceph cluster set up at one data centre as the master and ready for multisite replication. Operator image rook/ceph:v1.2.4 and ceph image ceph/ceph:v14.2.7-20200206 A new rook ceph cluster at a different data centre with operator image rook/ceph:v1.4.5 and ceph image ceph/ceph:v14.2.11-20200819 Both kubernetes clusters use istio gateways to route traffic to the rados gateway's Secret wilxite-keys created containing the access and secret key for the user wilxite-system-user on the master cluster

File(s) to submit:

apiVersion: ceph.rook.io/v1
kind: CephObjectRealm
metadata:
  name: wilxite
  namespace: rook-ceph
spec:
  pull:
    endpoint: https://rgw.masterdomain.com

Operator log error

2020-10-14 02:16:37.477756 E | ceph-object-realm-controller: failed to reconcile: realm pull failed for reason: . request failed: (2202) Unknown error 2202: exit status 154

Running the command manually from the toolbox

radosgw-admin realm pull --url=https://rgw.masterdomain.com --access-key=<access-key> --secret=<secret-key>
{
    "id": "4471102a-e708-4715-a3bc-40b7129ebd9d",
    "name": "wilxite",
    "current_period": "1d49d488-5c2b-496d-ab36-5daf8ac940bb",
    "epoch": 2
}

Environment:

OS (e.g. from /etc/os-release): Centos 7
Kernel (e.g. uname -a): 3.10.0-1127.19.1.el7.x86_64
Kubernetes version (use kubectl version): v1.19.2
Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): On-premise set up with kubeadm

What I don't understand is how the operator can complete this operation as when I have set up multisite manually I had to add the endpoint to the zone of the secondary cluster. There doesn't seem anywhere to add this and I'm guessing it just uses the internal kubernetes service IP or service name but this isn't accessible from the master zone, or doesn't that matter?

I also tried this with ceph v15 on the new cluster but the same result.

I am now a little bit stuck as after setting it up manually and removing the pull endpoint from the realm then the cephobjectstore has a failed status so I can't update it

status:
  bucketStatus:
    details: 'failed to create object user "rook-ceph-internal-s3-user-checker-dcd9a10e-33d2-4b81-914b-d9489e054995".
      error code 1 for object store "tango": failed to create s3 user: exit status
      22'
    health: Failure
    lastChecked: "2020-10-14T02:52:16Z"
  info:
    endpoint: http://rook-ceph-rgw-tango.rook-ceph.svc:80
  phase: Failure

Oct 14 '20 03:10 sazzle2611

Update

When I added back the realm pull endpoint after pulling the realm manually then it added the internal kubernetes ip to the zone endpoints. This showed up on the master when using 'radosgw-admin zonegroup get'.

This broke it, running 'radosgw-admin sync status' on the master returned this error

2020-10-14 03:36:14.280 7f7d1c9726c0  0 data sync zone:7bc93d77 ERROR: failed to fetch datalog info
      data sync source: 7bc93d77-ed1c-4c3b-965f-6e3f137af69b (tango)
                        failed to retrieve sync info: (5) Input/output error

I then ran these command on the secondary cluster to reset the zone to the proper values and all is good again

radosgw-admin zone modify --rgw-zone=tango --rgw-realm=wilxite --rgw-zonegroup=wilxite --access-key=<access-key> --secret=<secret-key> --endpoints https://rgw.secondarydomain.com
radosgw-admin period update --commit

I believe this would all be fixed if there was an extra field added to manually set the secondary zone endpoint in the CephObjectZone resource.

Please let me know if there's any more info you need as I would love to have this working properly and to stop having to set things up manually.

Thanks

Oct 14 '20 04:10 sazzle2611

This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.

Jan 26 '21 20:01 github-actions[bot]

This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.

Feb 03 '21 20:02 github-actions[bot]

I ran into the exact same issue mentioned above. The operator is using the internal ClusterIP for k8s services as Zone and Zone Group endpoint values. This won't work unless you've made the cluster services CIDR network routable and accessible between clusters. If I manually fix the zone and zone group endpoints to externally accessible URLs then everything starts working.

Jun 30 '21 18:06 walkamongus

@alimaredia can you take a look?

Jun 30 '21 18:06 travisn

If I manually fix the zone and zone group endpoints to externally accessible URLs then everything starts working.

Only until you restart the operator, then the operator adds the internal endpoint again:

            "name": "s3-secondary",
            "endpoints": [
                "https://s3.secondary.example.com",
                "http://10.244.86.61:80"
            ],

and the sync is broken again with failed to retrieve sync info: (5) Input/output error

Sep 21 '21 09:09 mwennrich

The original bug is against Rook v1.2. Can we at least verify that this is present on the currently supported Rook versions, v1.6 or above?

Sep 24 '21 17:09 BlaineEXE

Running v1.7.1 here and can confirm this is still an issue for us.

Sep 28 '21 22:09 bt-lemery

We are now on v1.7.1 as well and yep same issue

Sep 29 '21 23:09 sazzle2611

Thanks @bt-lemery and @sazzle2611. We've been discussing how to best approach the fix for this issue.

Sep 30 '21 14:09 BlaineEXE

There is an example of this failure in this CI run:

rgw-multisite-testing-realm-pull-error.zip

[update] I'll leave this here, but we don't believe this is showing the same issue.

Oct 14 '21 17:10 BlaineEXE

discussed this a bit with @alimaredia. if there's a need to expose zone endpoints, i'd recommend pointing zone endpoints at the object store's load balancer and exposing that instead

then the object store is free to add/remove rgws without having to modify the multisite configuration each time

Oct 28 '21 20:10 cbodley

Running on v.1.7.5 also have the same issue. It would be nice if we could edit the service or modify it in the CephObjectStore spec to use type: LoadBalancer.

Nov 18 '21 13:11 sebedh