Ceph multisite realm pull fails but works if run manually
Is this a bug report or feature request?
- Bug Report
Deviation from expected behavior: When adding a pull endpoint to the realm resource it fails to pull the realm
Expected behavior: It successfully pulls the realm
How to reproduce it (minimal and precise): One older rook ceph cluster set up at one data centre as the master and ready for multisite replication. Operator image rook/ceph:v1.2.4 and ceph image ceph/ceph:v14.2.7-20200206 A new rook ceph cluster at a different data centre with operator image rook/ceph:v1.4.5 and ceph image ceph/ceph:v14.2.11-20200819 Both kubernetes clusters use istio gateways to route traffic to the rados gateway's Secret wilxite-keys created containing the access and secret key for the user wilxite-system-user on the master cluster
File(s) to submit:
apiVersion: ceph.rook.io/v1
kind: CephObjectRealm
metadata:
name: wilxite
namespace: rook-ceph
spec:
pull:
endpoint: https://rgw.masterdomain.com
Operator log error
2020-10-14 02:16:37.477756 E | ceph-object-realm-controller: failed to reconcile: realm pull failed for reason: . request failed: (2202) Unknown error 2202: exit status 154
Running the command manually from the toolbox
radosgw-admin realm pull --url=https://rgw.masterdomain.com --access-key=<access-key> --secret=<secret-key>
{
"id": "4471102a-e708-4715-a3bc-40b7129ebd9d",
"name": "wilxite",
"current_period": "1d49d488-5c2b-496d-ab36-5daf8ac940bb",
"epoch": 2
}
Environment:
- OS (e.g. from /etc/os-release): Centos 7
- Kernel (e.g.
uname -a): 3.10.0-1127.19.1.el7.x86_64 - Kubernetes version (use
kubectl version): v1.19.2 - Kubernetes cluster type (e.g. Tectonic, GKE, OpenShift): On-premise set up with kubeadm
What I don't understand is how the operator can complete this operation as when I have set up multisite manually I had to add the endpoint to the zone of the secondary cluster. There doesn't seem anywhere to add this and I'm guessing it just uses the internal kubernetes service IP or service name but this isn't accessible from the master zone, or doesn't that matter?
I also tried this with ceph v15 on the new cluster but the same result.
I am now a little bit stuck as after setting it up manually and removing the pull endpoint from the realm then the cephobjectstore has a failed status so I can't update it
status:
bucketStatus:
details: 'failed to create object user "rook-ceph-internal-s3-user-checker-dcd9a10e-33d2-4b81-914b-d9489e054995".
error code 1 for object store "tango": failed to create s3 user: exit status
22'
health: Failure
lastChecked: "2020-10-14T02:52:16Z"
info:
endpoint: http://rook-ceph-rgw-tango.rook-ceph.svc:80
phase: Failure
Update
When I added back the realm pull endpoint after pulling the realm manually then it added the internal kubernetes ip to the zone endpoints. This showed up on the master when using 'radosgw-admin zonegroup get'.
This broke it, running 'radosgw-admin sync status' on the master returned this error
2020-10-14 03:36:14.280 7f7d1c9726c0 0 data sync zone:7bc93d77 ERROR: failed to fetch datalog info
data sync source: 7bc93d77-ed1c-4c3b-965f-6e3f137af69b (tango)
failed to retrieve sync info: (5) Input/output error
I then ran these command on the secondary cluster to reset the zone to the proper values and all is good again
radosgw-admin zone modify --rgw-zone=tango --rgw-realm=wilxite --rgw-zonegroup=wilxite --access-key=<access-key> --secret=<secret-key> --endpoints https://rgw.secondarydomain.com
radosgw-admin period update --commit
I believe this would all be fixed if there was an extra field added to manually set the secondary zone endpoint in the CephObjectZone resource.
Please let me know if there's any more info you need as I would love to have this working properly and to stop having to set things up manually.
Thanks
This issue has been automatically marked as stale because it has not had recent activity. It will be closed in a week if no further activity occurs. Thank you for your contributions.
This issue has been automatically closed due to inactivity. Please re-open if this still requires investigation.
I ran into the exact same issue mentioned above. The operator is using the internal ClusterIP for k8s services as Zone and Zone Group endpoint values. This won't work unless you've made the cluster services CIDR network routable and accessible between clusters. If I manually fix the zone and zone group endpoints to externally accessible URLs then everything starts working.
@alimaredia can you take a look?
If I manually fix the zone and zone group endpoints to externally accessible URLs then everything starts working.
Only until you restart the operator, then the operator adds the internal endpoint again:
"name": "s3-secondary",
"endpoints": [
"https://s3.secondary.example.com",
"http://10.244.86.61:80"
],
and the sync is broken again with failed to retrieve sync info: (5) Input/output error
The original bug is against Rook v1.2. Can we at least verify that this is present on the currently supported Rook versions, v1.6 or above?
Running v1.7.1 here and can confirm this is still an issue for us.
We are now on v1.7.1 as well and yep same issue
Thanks @bt-lemery and @sazzle2611. We've been discussing how to best approach the fix for this issue.
There is an example of this failure in this CI run:
rgw-multisite-testing-realm-pull-error.zip
[update] I'll leave this here, but we don't believe this is showing the same issue.
discussed this a bit with @alimaredia. if there's a need to expose zone endpoints, i'd recommend pointing zone endpoints at the object store's load balancer and exposing that instead
then the object store is free to add/remove rgws without having to modify the multisite configuration each time
Running on v.1.7.5 also have the same issue. It would be nice if we could edit the service or modify it in the CephObjectStore spec to use type: LoadBalancer.