noobaa-operator icon indicating copy to clipboard operation
noobaa-operator copied to clipboard

Scale out noobaa-core StatefulSet gracefully to a fully-functional state

Open ron1 opened this issue 4 years ago • 6 comments

Environment: OCP 3.11, Rook Ceph 1.1.6 block provisioner, NooBaa 2.0.8-SNAPSHOT including PR https://github.com/noobaa/noobaa-operator/pull/146

Pre-req: Configure two pv-pool BackingStores with one volume each wrapped in a Mirror BucketClass. Then define a StorageClass for this BucketClass and create an OBC. Rclone a 1000 files into the OBC bucket. Then scale out noobaa-core StatefulSet from 1 pod to 2 pods and wait for 2nd pod to be Ready. Finally, use Rclone to attempt to delete the files from the bucket.

Expected Result: Files are successfully deleted from the bucket.

Actual Result: Many AccessDenied errors like the one below are generated.

Note: Follow-up by scaling the noobaa-core StatefulSet back to 1 pod and wait for the added pod to be cleared. Then Rclone successfully deletes the files from the bucket.

Nov-16 8:48:49.892 [Endpoint/86] [ERROR] core.endpoint.s3.s3_rest:: S3 ERROR <?xml version="1.0" encoding="UTF-8"?><Error><Code>AccessDenied</Code><Message>Access Denied</Message><Resource>/cks-dev1-694b62ed-9bb5-44a7-9168-7e57613b7ffb?delimiter=%2F&amp;max-keys=1000&amp;prefix=test3-dir%2Fconf%2FCatalina%2F</Resource><RequestId>k31bxext-26iskt-etz</RequestId></Error> GET /cks-dev1-694b62ed-9bb5-44a7-9168-7e57613b7ffb?delimiter=%2F&max-keys=1000&prefix=test3-dir%2Fconf%2FCatalina%2F {"user-agent":"rclone/v1.50.1","authorization":"AWS4-HMAC-SHA256 Credential=yAMzs0pm5zn2bFXuMOe3/20191116/us-east-1/s3/aws4_request, SignedHeaders=host;x-amz-content-sha256;x-amz-date, Signature=037cf537dc33546bf7bc501ddb217f837a26f18e12b6135a00e6a71c88c50cce","x-amz-content-sha256":"e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855","x-amz-date":"20191116T084849Z","accept-encoding":"gzip","host":"internal-s3-noobaa-1937458161.us-east-1.elb.amazonaws.com","x-forwarded-host":"internal-s3-noobaa-1937458161.us-east-1.elb.amazonaws.com","x-forwarded-port":"443","x-forwarded-proto":"https","forwarded":"for=10.113.245.89;host=internal-s3-noobaa-1937458161.us-east-1.elb.amazonaws.com;proto=https;proto-version=","x-forwarded-for":"10.113.245.89"} Error: account not found
    at RpcRequest._set_response (/root/node_modules/noobaa-core/src/rpc/rpc_request.js:163:26)
    at RPC._on_response (/root/node_modules/noobaa-core/src/rpc/rpc.js:412:32)
    at RPC._on_message (/root/node_modules/noobaa-core/src/rpc/rpc.js:748:22)
    at RpcWsConnection.conn.on.msg (/root/node_modules/noobaa-core/src/rpc/rpc.js:583:40)
    at RpcWsConnection.emit (events.js:198:13)
    at RpcWsConnection.EventEmitter.emit (domain.js:448:20)
    at WebSocket.ws.on (/root/node_modules/noobaa-core/src/rpc/rpc_ws.js:44:53)
    at WebSocket.emit (events.js:198:13)
    at WebSocket.EventEmitter.emit (domain.js:448:20)
    at Receiver.receiverOnMessage (/root/node_modules/noobaa-core/node_modules/ws/lib/websocket.js:800:20)
    at Receiver.emit (events.js:198:13)
    at Receiver.EventEmitter.emit (domain.js:448:20)
    at Receiver.dataMessage (/root/node_modules/noobaa-core/node_modules/ws/lib/receiver.js:413:14)
    at Receiver.getData (/root/node_modules/noobaa-core/node_modules/ws/lib/receiver.js:352:17)
    at Receiver.startLoop (/root/node_modules/noobaa-core/node_modules/ws/lib/receiver.js:138:22)
    at Receiver._write (/root/node_modules/noobaa-core/node_modules/ws/lib/receiver.js:74:10)

ron1 avatar Nov 17 '19 01:11 ron1

Hi @ron1

We don't scale out by increasing the core replicas - the result is two different systems serving under the same route.

The first level of scale out is designed to work by increasing the number of endpoints, however, currently operating it is not so straight forward - it works such that every pv-pool pod also serve as S3 endpoint.

This is why we are working on #58 and also splitting the endpoints to its own deployment.

I will try to initiate a doc explaining scale out status and instructions.

In the meantime, avoid changing the replicas of noobaa-core sts at all costs, and use only the replicas of the pv-pool to add more storage and endpoints.

In addition you can scale up by adding more resources to the noobaa-core pod which will use more cpu/mem for serving clients.

I know this might not be what you expected, but we can sync on the design plans and get your feedback as we develop.

Apologies for the confusion we caused.

guymguym avatar Nov 17 '19 03:11 guymguym

@guymguym Thanks for the clarification. Do you have ideas about configuring MongoDB with NooBaa v2 such that it is not a single point of failure in the deployment? Also, do you have plans beyond https://github.com/noobaa/noobaa-operator/pull/58 to provide an HA and scale-out MongoDB instance with support for replicasets, sharding, etc.? As you know, this type of MongoDB deployment is not trivial to manage. For this functionality, will you integrate something like the FOSS Percona MongoDB Kubernetes Operator into NooBaa or maybe direct folks with enterprise-grade requirements to deploy with the MongoDB Enterprise Kubernetes Operator?

ron1 avatar Nov 17 '19 04:11 ron1

Hey @ron1

For now, the high-availability model for noobaa-core DB is based on two things:

  1. PV that is highly available and can recover from failures - in our case Ceph RBD.
  2. Kubernetes scheduler that recovers from node/pod failures by rescheduling the pod on a healthy node.

Currently we have two roadmap items regarding DB scaling:

  1. Single-cluster - use shading for scaling out within a kubernetes cluster.
  2. Multi-cluster - use active-active architecture (based on both replicaset and locallity-sharding) as suggested here https://www.mongodb.com/blog/post/active-active-application-architectures-with-mongodb.

If we can leverage other operators in order to achieve those it can be nice, but I suspect we might need more control.

Let me know if this covers the use cases you plan or we should consider other solutions.

Thanks

guymguym avatar Nov 18 '19 10:11 guymguym

Hi @guymguym

I believe the HA semantics offered by the Kubernetes scheduler vary depending upon the condition. In the case where the node hosting the noobaa-core pod fails for example, I think it could be multiple minutes before the pod is successfully scheduled on another node. While this should be an infrequent occurrence, neverthless a multiple minute read/write operation outage is problematic for us.

I am also concerned about reduced availability resulting from simple noobaa-core pod restarts due to node OS upgrades, node OCP upgrades, NooBaa Core version upgrades, etc. How would you compare the HA features offered by MongoDB ReplicaSets to those offered by the Kubernetes Scheduler with regards to StatefulSet-managed, scheduled and unscheduled noobaa-core pod restarts?

I think Ceph RGW sets a super high standard with regards to an HA object store. For OCS-like deployments that put NooBaa in front of RGW, I would think it be important for customers to understand the HA-implications of such an architecture. Would it be possible to benchmark noobaa-core failover performance using Ceph-RBD for the db-volume? Maybe this would confirm that my failover concerns are only relevant for edge cases.

Since the VM-based NooBaa docs reference "NooBaa Core uses MongoDB but keep in mind there is no HA in the free distribution...The Enterprise Edition will offer a pay-per-use pricing model and provide a super resilient and available NooBaa Core engine" and since NooBaa is now open-sourced, I just assumed the Kubernetes NooBaa Core image simply inherited the VM-based HA "Enterprise" feature set which I presume leverages MongoDB Replica Sets. I also assumed that the team had just not gotten around to introducing a coreReplicas property to the NooBaa CRD to manage the noobaa-core statefulset replicas count.

Finally, when sharding is introduced in the future for scaling a single cluster and supporting multiple clusters, am I correct that the replicasets in this architecture will inherently address my failover concerns?

ron1 avatar Nov 18 '19 16:11 ron1

Hi @ron1 You are correct that the current offering is not an HA in the way we have gotten used to. Indeed before we moved from VM based deployments to the k8s based ones, we had a more traditional HA solution. It is on our roadmap for future versions. As plans currently seem, we have the DR use case coming in the version after the next

nimrod-becker avatar Nov 18 '19 19:11 nimrod-becker

Hi @nimrod-becker

Thanks for getting back to me.

ron1 avatar Nov 18 '19 20:11 ron1