prisma1
prisma1 copied to clipboard
Prisma Horizontal Scaling
Describe the bug When I try to scale Prisma horizontally by adding a second server, the first server logs:
Obtaining exclusive agent lock...
Obtaining exclusive agent lock... Successful.
then the second server logs just this and crashes:
Obtaining exclusive agent lock...
The reason appears to be this line of code: https://github.com/prismagraphql/prisma/blob/d5c97fe8f1c1ee223ec1392ebdf16f2545b2f763/server/servers/deploy/src/main/scala/com/prisma/deploy/migration/migrator/DeploymentSchedulerActor.scala#L52
It seems that Prisma is explicitly ensuring that there's only ever 1 cluster/server (prisma terminology changes) that can run against a DB.
To Reproduce Steps to reproduce the behavior:
- Initialize Prisma server from Prisma AWS Fargate template (https://github.com/prismagraphql/prisma-templates/tree/master/aws)
- Go to ECS cluster (name will be what you used during CF stack creation
- Click on
Prismaservice - Click Update in top-right corner
- Increase
Number of tasksfrom 1 to 2 - Click Next, Next, Next, and Done
Expected behavior 2 Prisma instances would run against 1 DB
Screenshots None
Versions (please complete the following information):
- OS: Irrelevant, it's in AWS
prismaCLI:prisma/1.11.1 (darwin-x64) node-v8.11.1- Prisma Server:
1.11.0(per Cloudformation template)
Additional context Already reported in the slack channel. Was told to create a bug here. @divyenduz
+1, encountering this problem as well
Hey @mcmar ,
thanks for bringing this up. This is an area where we are lacking documentation. The prisma server can be started either with the management API enabled or not. If the management API is enabled it will try to acquire the agent lock on startup. This is to ensure that there is only one Prisma server at a time writing into the management tables. So your second server also has the management API hence you are seeeing this log message.
The management API can be simply enabled in the Prisma server config, e.g.:
port: 60000
managementApiSecret: my-secret
rabbitUri: amqp://my-rabbitmq-server
enableManagementApi: true|false
databases:
default:
connector: mysql
...
In Prisma Cloud we are running Prisma like this for horizontal scalability:
- Run exactly 1 one Prisma server with the Management API enabled. Additionally run multiple Prisma servers with the Management API disabled. Internally we call those server types primary and secondary.
- Your load balancer in front of the servers must be setup like this:
- All requests to
/managementmust be routed to the primary server. - All other requests may be routed to any of the servers (primary + secondary).
- All requests to
- in addition to this you will also need a RabbitMQ server, which we use for PubSub. We need it to publish change events about the data to all servers so that they can notify connected subscriptions over Websocket. (see the config entry
rabbitUriabove). If you don't want to run a RabbitMQ server on your own, i recommend CloudAMQP as a hoster.
Does that help?
in addition to this you will also need a RabbitMQ server, which we use for PubSub
Uh, is that a requirement or is it only necessary in case you use subscriptions? @mavilein Could you clarify that please?
@emmenko : Right now it is required. We need for RabbitMQ for those reasons:
- as PubSub to power subscriptions over Websockets
- to store ServerSideSubscriptions/Webhooks in a Queue. This allows us to eventually deliver Webhooks even if the Server goes down due a crash or reboot.
- to propagate information about schema changes to all servers. We need this as each server holds a cache of the schema of service. Not having a cache would mean adding significant latency to query execution.
I guess you are fine with point 1. Point 2 is a tradeoff you need to decide for yourself in your usecase. point 3 is currently a blocker.
If we can find a solution for point 3 we could repackage the Prisma server without the RabbitMQ dependency. We could enable this through a separate Docker image or configuration flag.
I see. So if I want multiple replicas I also need to have a rabbitmq cluster on the side.
Do you plan to make the pubsub system configurable or is rabbitmq the only option? For example, I’m running my services on GCP and it would be easier to use google pubsub.
Thanks anyway for the explanation! 🙏
@mavilein Is Prisma using AMQP 0.9.1 or 1.0? 1.0 will work with Apache ActiveMQ and Amazon MQ, which would make my life much easier. 0.9.1 would require me to spin up my own RabbitMQ service with its own ELB.
@emmenko : RabbitMQ is currently the only option, but we have encapsulated our pubsub code into a neat interface. We could provide additional implementations for e.g. google pubsub. I just added a Feature request for this.
@mcmar : We are using the RabbitMQ Java client, so i think this 0.9.1 then.
@mcmar @emmenko : Would you be happier if we would support Apache Kafka? We are considering to add it for another feature anyway.
Thanks. For now I'm trying using the stable/rabbitmq helm chart. I think it should be fine.
In our specific case, we run our services on K8s on Google Cloud, so for us an integration with Google PubSub would be perfect so that we don't have to manage that on our own 😉
@mavilein I'm trying to deploy prisma with 1 primary and 2 secondary. The primary has the managementApi enabled, the secondaries do not.
However, after starting, the primary and one of the secondaries keep crashing with the error
Fatal error during deployment worker initialization: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://single-server/user/$b#-844349262]] after [300000 ms]. Sender[null] sent message of type "com.prisma.deploy.migration.migrator.DeploymentProtocol$Initialize$".
I noticed that all 3 prisma containers are trying to "obtain the agent lock". From what you wrote before I tought that only the primary is suppose the get the lock, or should all of them do it? In that case, any idea why am I still getting errors? 🤔
I'm using prisma:1.13.4.
Now the primary and one of the secondary are running but the 2nd secondary keeps crashing (no errors in the logs, only 👇)
Obtaining exclusive agent lock...
Initializing workers...
Successfully started 1 workers.
Server running on :4466
Version is up to date.
Am I doing something wrong? 🤔
Tried also with 1 primary and 1 secondary. After a couple of min, the secondary crashed with the same error.
I have a feeling that the management API is still enabled in both. I checked and I'm passing enableManagementApi: true to the primary and enableManagementApi: false to the secondary.
In case it helps, here are the logs for the pods (for the timings)
$ kubectl get pods -w | grep prisma
prisma-primary-c6f64d69d-ckkbn 2/2 Running 0 3m
prisma-secondary-75b857b766-xx8jz 2/2 Running 0 3m
prisma-secondary-75b857b766-xx8jz 1/2 Error 0 6m
prisma-secondary-75b857b766-xx8jz 1/2 Running 1 6m
prisma-secondary-75b857b766-xx8jz 2/2 Running 1 8m
And here the logs for the primary
Obtaining exclusive agent lock...
Obtaining exclusive agent lock... Successful.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
Deployment worker initialization complete.
Initializing workers...
Successfully started 1 workers.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
Server running on :4466
Version is up to date.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
[Metrics] No Prisma Cloud secret is set. Metrics collection is disabled.
and here for the secondary
Obtaining exclusive agent lock...
Initializing workers...
Successfully started 1 workers.
Server running on :4466
Version is up to date.
Fatal error during deployment worker initialization: akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka://single-server/user/$b#-1603503801]] after [300000 ms]. Sender[null] sent message of type "com.prisma.deploy.migration.migrator.DeploymentProtocol$Initialize$".
I noticed that when I try to access on both containers the http://localhost:4466/management I get the graphql playground. Is this supposed to be working even if the enableManagementApi is set to false?
@emmenko : Oh my bad. I forgot to say that you need to use prisma-prod image. Only this one contains the necessary ifs. We should really improve this experience. Thx for keeping to dig 👍
I forgot to say that you need to use
prisma-prodimage
Ooooh, thanks! 😇
I'll try that right away.
Btw, I'm happy to contribute to the documentation feedback with the experience I had so far. Let me know in case you need that 😉
Works! 🙌
$ kubectl get pods -w | grep prisma
prisma-rabbitmq-0 1/1 Running 0 4h
prisma-rabbitmq-1 1/1 Running 0 4h
prisma-rabbitmq-2 1/1 Running 0 4h
prisma-primary-56d5699664-sn2sj 2/2 Running 0 35m
prisma-secondary-cbb5c87b8-c8qdn 2/2 Running 0 27m
prisma-secondary-cbb5c87b8-z8zgh 2/2 Running 0 23m
@emmenko : Nice. 🎉 I have opened an issue to unify our 2 Docker images as they do not seem necessary to me and just cause confusion. Happy to come back to you to get your feedback on the docs when we have a first version ready! 🙏
@mavilein I'm attempting to implement the pattern you described in which you route /management to prisma-primary and all other routes * to prisma-primary and prisma-secondary.
I'm unable to implement that pattern in AWS using Application Load Balancers because ECS services can only register themselves in one target group.
I'm working off of the fargate.yml template in the prisma-templates repo.
How does prisma host their own servers in AWS? Do you use ECS? Do you have 2 separate prisma services? Do you use Application Load Balancers as opposed to Classic Load Balancers? I can't find a way to implement it using ECS with ALBs.
Here's the issue in ECS: https://github.com/aws/amazon-ecs-agent/issues/1351#issuecomment-412377706
Hey @emmenko if you have a fairly generic kubernetes template for prisma with support for horizontal scaling, would you mind posting it or submitting a PR against https://github.com/prismagraphql/prisma-templates ? The current docs only show the single-server setup. I'm currently working on adding a second Cloudformation template for horizontal scaling. It'd be good to get something in there for Kubernetes too. I thought it'd be cool if we could all contribute back what we're learning and grow the OSS community around Prisma.
@mcmar hey, sure thing! I’ll start working on that in the next days 👌
Handy thread. Thank you for the information. I'd like to cast my vote for Google PubSub also as we currently have a cloud function consuming Prisma subscriptions over HTTP and passing them into GCP PubSub. This would reduce the latency.
@emmenko @mavilein hi, I'm in the same situation, I'm a bit confused, I'm stuck at the prisma-prod image step, when I try to run the container with this image, I get an error like I'm missing some SQL_INTERNAL_PASSWORD env var. I'm using cloud sql postgres and I can't find where I missed something and what this var is.
Where are you running the containers? Kubernetes?
@emmenko kubernetes engine yes
How do you pass the PRISMA_CONFIG?
here is my config
- name: PRISMA_CONFIG
value: |
port: 4466
rabbitUri: amqp://...
managementApiEnabled: false
databases:
default:
connector: postgres
host: 127.0.0.1
port: 5432
user: "$(PG_USERNAME)"
password: "$(PG_PASSWORD)"
migrations: true
connectionLimit: 4
with the 'prismagraphql/prisma:1.14' image the container is ok but I has the exclusive agent lock problem and the container restarts every 5min with the 'prismagraphql/prisma-prod' image the container don't even start and fire the missing var SQL_INTERNAL_PASSWORD error
by the way thanx for your help
Hmm the config looks good. I'm using those images and for me things work
images:
prisma:
repository: prismagraphql/prisma-prod
tag: 1.14
pullPolicy: IfNotPresent
cloudsql:
repository: gcr.io/cloudsql-docker/gce-proxy
tag: 1.11
pullPolicy: IfNotPresent
Btw: I have one deployment for the "normal" prisma replicas which are connected to the LB, plus a deployment for the "management" prisma (1 replica only) that is not served by the LB (it's only used by port-forwarding).
Hopefully I manage to share my chart in the next weeks, in case it helps others ;)