prisma1 icon indicating copy to clipboard operation
prisma1 copied to clipboard

Prisma Horizontal Scaling

Open mcmar opened this issue 7 years ago • 57 comments

Describe the bug When I try to scale Prisma horizontally by adding a second server, the first server logs:

Obtaining exclusive agent lock...
Obtaining exclusive agent lock... Successful.

then the second server logs just this and crashes:

Obtaining exclusive agent lock...

The reason appears to be this line of code: https://github.com/prismagraphql/prisma/blob/d5c97fe8f1c1ee223ec1392ebdf16f2545b2f763/server/servers/deploy/src/main/scala/com/prisma/deploy/migration/migrator/DeploymentSchedulerActor.scala#L52

It seems that Prisma is explicitly ensuring that there's only ever 1 cluster/server (prisma terminology changes) that can run against a DB.

To Reproduce Steps to reproduce the behavior:

  1. Initialize Prisma server from Prisma AWS Fargate template (https://github.com/prismagraphql/prisma-templates/tree/master/aws)
  2. Go to ECS cluster (name will be what you used during CF stack creation
  3. Click on Prisma service
  4. Click Update in top-right corner
  5. Increase Number of tasks from 1 to 2
  6. Click Next, Next, Next, and Done

Expected behavior 2 Prisma instances would run against 1 DB

Screenshots None

Versions (please complete the following information):

  • OS: Irrelevant, it's in AWS
  • prisma CLI: prisma/1.11.1 (darwin-x64) node-v8.11.1
  • Prisma Server: 1.11.0 (per Cloudformation template)

Additional context Already reported in the slack channel. Was told to create a bug here. @divyenduz

mcmar avatar Aug 01 '18 16:08 mcmar

+1, encountering this problem as well

terijyu avatar Aug 01 '18 21:08 terijyu

Hey @mcmar ,

thanks for bringing this up. This is an area where we are lacking documentation. The prisma server can be started either with the management API enabled or not. If the management API is enabled it will try to acquire the agent lock on startup. This is to ensure that there is only one Prisma server at a time writing into the management tables. So your second server also has the management API hence you are seeeing this log message.

The management API can be simply enabled in the Prisma server config, e.g.:

port: 60000
managementApiSecret: my-secret
rabbitUri: amqp://my-rabbitmq-server
enableManagementApi: true|false
databases:
  default:
    connector: mysql
    ...

In Prisma Cloud we are running Prisma like this for horizontal scalability:

  • Run exactly 1 one Prisma server with the Management API enabled. Additionally run multiple Prisma servers with the Management API disabled. Internally we call those server types primary and secondary.
  • Your load balancer in front of the servers must be setup like this:
    • All requests to /management must be routed to the primary server.
    • All other requests may be routed to any of the servers (primary + secondary).
  • in addition to this you will also need a RabbitMQ server, which we use for PubSub. We need it to publish change events about the data to all servers so that they can notify connected subscriptions over Websocket. (see the config entry rabbitUri above). If you don't want to run a RabbitMQ server on your own, i recommend CloudAMQP as a hoster.

Does that help?

mavilein avatar Aug 02 '18 09:08 mavilein

in addition to this you will also need a RabbitMQ server, which we use for PubSub

Uh, is that a requirement or is it only necessary in case you use subscriptions? @mavilein Could you clarify that please?

emmenko avatar Aug 02 '18 15:08 emmenko

@emmenko : Right now it is required. We need for RabbitMQ for those reasons:

  1. as PubSub to power subscriptions over Websockets
  2. to store ServerSideSubscriptions/Webhooks in a Queue. This allows us to eventually deliver Webhooks even if the Server goes down due a crash or reboot.
  3. to propagate information about schema changes to all servers. We need this as each server holds a cache of the schema of service. Not having a cache would mean adding significant latency to query execution.

I guess you are fine with point 1. Point 2 is a tradeoff you need to decide for yourself in your usecase. point 3 is currently a blocker.

If we can find a solution for point 3 we could repackage the Prisma server without the RabbitMQ dependency. We could enable this through a separate Docker image or configuration flag.

mavilein avatar Aug 02 '18 16:08 mavilein

I see. So if I want multiple replicas I also need to have a rabbitmq cluster on the side.

Do you plan to make the pubsub system configurable or is rabbitmq the only option? For example, I’m running my services on GCP and it would be easier to use google pubsub.

Thanks anyway for the explanation! 🙏

emmenko avatar Aug 02 '18 17:08 emmenko

@mavilein Is Prisma using AMQP 0.9.1 or 1.0? 1.0 will work with Apache ActiveMQ and Amazon MQ, which would make my life much easier. 0.9.1 would require me to spin up my own RabbitMQ service with its own ELB.

mcmar avatar Aug 02 '18 19:08 mcmar

@emmenko : RabbitMQ is currently the only option, but we have encapsulated our pubsub code into a neat interface. We could provide additional implementations for e.g. google pubsub. I just added a Feature request for this.

@mcmar : We are using the RabbitMQ Java client, so i think this 0.9.1 then.

@mcmar @emmenko : Would you be happier if we would support Apache Kafka? We are considering to add it for another feature anyway.

mavilein avatar Aug 03 '18 08:08 mavilein

Thanks. For now I'm trying using the stable/rabbitmq helm chart. I think it should be fine. In our specific case, we run our services on K8s on Google Cloud, so for us an integration with Google PubSub would be perfect so that we don't have to manage that on our own 😉

emmenko avatar Aug 03 '18 08:08 emmenko

@mavilein I'm trying to deploy prisma with 1 primary and 2 secondary. The primary has the managementApi enabled, the secondaries do not.

However, after starting, the primary and one of the secondaries keep crashing with the error

Fatal error during deployment worker initialization: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://single-server/user/$b#-844349262]] after [300000 ms]. Sender[null] sent message of type "com.prisma.deploy.migration.migrator.DeploymentProtocol$Initialize$".

I noticed that all 3 prisma containers are trying to "obtain the agent lock". From what you wrote before I tought that only the primary is suppose the get the lock, or should all of them do it? In that case, any idea why am I still getting errors? 🤔

I'm using prisma:1.13.4.

emmenko avatar Aug 03 '18 08:08 emmenko

Now the primary and one of the secondary are running but the 2nd secondary keeps crashing (no errors in the logs, only 👇)

Obtaining exclusive agent lock...
Initializing workers...
Successfully started 1 workers.
Server running on :4466
Version is up to date.

Am I doing something wrong? 🤔

emmenko avatar Aug 03 '18 09:08 emmenko

Tried also with 1 primary and 1 secondary. After a couple of min, the secondary crashed with the same error.

emmenko avatar Aug 03 '18 09:08 emmenko

I have a feeling that the management API is still enabled in both. I checked and I'm passing enableManagementApi: true to the primary and enableManagementApi: false to the secondary.

emmenko avatar Aug 03 '18 10:08 emmenko

In case it helps, here are the logs for the pods (for the timings)

$ kubectl get pods -w | grep prisma
prisma-primary-c6f64d69d-ckkbn          2/2       Running   0          3m
prisma-secondary-75b857b766-xx8jz       2/2       Running   0          3m
prisma-secondary-75b857b766-xx8jz   1/2       Error     0         6m
prisma-secondary-75b857b766-xx8jz   1/2       Running   1         6m
prisma-secondary-75b857b766-xx8jz   2/2       Running   1         8m

And here the logs for the primary

Obtaining exclusive agent lock...
Obtaining exclusive agent lock... Successful.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
Deployment worker initialization complete.
Initializing workers...
Successfully started 1 workers.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
Server running on :4466
Version is up to date.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.

and here for the secondary

Obtaining exclusive agent lock...
Initializing workers...
Successfully started 1 workers.
Server running on :4466
Version is up to date.
Fatal error during deployment worker initialization: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://single-server/user/$b#-1603503801]] after [300000 ms]. Sender[null] sent message of type "com.prisma.deploy.migration.migrator.DeploymentProtocol$Initialize$".

emmenko avatar Aug 03 '18 10:08 emmenko

I noticed that when I try to access on both containers the http://localhost:4466/management I get the graphql playground. Is this supposed to be working even if the enableManagementApi is set to false?

emmenko avatar Aug 03 '18 10:08 emmenko

@emmenko : Oh my bad. I forgot to say that you need to use prisma-prod image. Only this one contains the necessary ifs. We should really improve this experience. Thx for keeping to dig 👍

mavilein avatar Aug 03 '18 10:08 mavilein

I forgot to say that you need to use prisma-prod image

Ooooh, thanks! 😇

I'll try that right away.

Btw, I'm happy to contribute to the documentation feedback with the experience I had so far. Let me know in case you need that 😉

emmenko avatar Aug 03 '18 10:08 emmenko

Works! 🙌

$ kubectl get pods -w | grep prisma
prisma-rabbitmq-0         1/1       Running   0          4h
prisma-rabbitmq-1         1/1       Running   0          4h
prisma-rabbitmq-2         1/1       Running   0          4h
prisma-primary-56d5699664-sn2sj         2/2       Running   0          35m
prisma-secondary-cbb5c87b8-c8qdn        2/2       Running   0          27m
prisma-secondary-cbb5c87b8-z8zgh        2/2       Running   0          23m

emmenko avatar Aug 03 '18 11:08 emmenko

@emmenko : Nice. 🎉 I have opened an issue to unify our 2 Docker images as they do not seem necessary to me and just cause confusion. Happy to come back to you to get your feedback on the docs when we have a first version ready! 🙏

mavilein avatar Aug 03 '18 11:08 mavilein

@mavilein I'm attempting to implement the pattern you described in which you route /management to prisma-primary and all other routes * to prisma-primary and prisma-secondary. I'm unable to implement that pattern in AWS using Application Load Balancers because ECS services can only register themselves in one target group. I'm working off of the fargate.yml template in the prisma-templates repo. How does prisma host their own servers in AWS? Do you use ECS? Do you have 2 separate prisma services? Do you use Application Load Balancers as opposed to Classic Load Balancers? I can't find a way to implement it using ECS with ALBs.

Here's the issue in ECS: https://github.com/aws/amazon-ecs-agent/issues/1351#issuecomment-412377706

mcmar avatar Aug 12 '18 23:08 mcmar

Hey @emmenko if you have a fairly generic kubernetes template for prisma with support for horizontal scaling, would you mind posting it or submitting a PR against https://github.com/prismagraphql/prisma-templates ? The current docs only show the single-server setup. I'm currently working on adding a second Cloudformation template for horizontal scaling. It'd be good to get something in there for Kubernetes too. I thought it'd be cool if we could all contribute back what we're learning and grow the OSS community around Prisma.

mcmar avatar Aug 14 '18 22:08 mcmar

@mcmar hey, sure thing! I’ll start working on that in the next days 👌

emmenko avatar Aug 15 '18 08:08 emmenko

Handy thread. Thank you for the information. I'd like to cast my vote for Google PubSub also as we currently have a cloud function consuming Prisma subscriptions over HTTP and passing them into GCP PubSub. This would reduce the latency.

develomark avatar Aug 21 '18 16:08 develomark

@emmenko @mavilein hi, I'm in the same situation, I'm a bit confused, I'm stuck at the prisma-prod image step, when I try to run the container with this image, I get an error like I'm missing some SQL_INTERNAL_PASSWORD env var. I'm using cloud sql postgres and I can't find where I missed something and what this var is.

lethot avatar Aug 23 '18 13:08 lethot

Where are you running the containers? Kubernetes?

emmenko avatar Aug 23 '18 13:08 emmenko

@emmenko kubernetes engine yes

lethot avatar Aug 23 '18 14:08 lethot

How do you pass the PRISMA_CONFIG?

emmenko avatar Aug 23 '18 14:08 emmenko

here is my config

- name: PRISMA_CONFIG
          value: |
            port: 4466
            rabbitUri: amqp://...
            managementApiEnabled: false
            databases:
              default:
                connector: postgres
                host: 127.0.0.1
                port: 5432
                user: "$(PG_USERNAME)"
                password: "$(PG_PASSWORD)"
                migrations: true
                connectionLimit: 4

lethot avatar Aug 23 '18 14:08 lethot

with the 'prismagraphql/prisma:1.14' image the container is ok but I has the exclusive agent lock problem and the container restarts every 5min with the 'prismagraphql/prisma-prod' image the container don't even start and fire the missing var SQL_INTERNAL_PASSWORD error

by the way thanx for your help

lethot avatar Aug 23 '18 14:08 lethot

Hmm the config looks good. I'm using those images and for me things work

images:
  prisma:
    repository: prismagraphql/prisma-prod
    tag: 1.14
    pullPolicy: IfNotPresent
  cloudsql:
    repository: gcr.io/cloudsql-docker/gce-proxy
    tag: 1.11
    pullPolicy: IfNotPresent

emmenko avatar Aug 23 '18 14:08 emmenko

Btw: I have one deployment for the "normal" prisma replicas which are connected to the LB, plus a deployment for the "management" prisma (1 replica only) that is not served by the LB (it's only used by port-forwarding).

Hopefully I manage to share my chart in the next weeks, in case it helps others ;)

emmenko avatar Aug 23 '18 14:08 emmenko