opsdroid Adding scalability features in Opsdroid

Adding scalability features in Opsdroid

Open bparbhu opened this issue 2 years ago • 13 comments

Hi Opsdroid team,

I just wanted to say I'm a big fan of the opsdroid framework and I love the design choices behind it and the modular aesthetic that runs through it from beginning to end. I'm planning on using Opsdroid to build a chatbot that can service thousands of clients at a given time and help alleviate ticketing issued by them.

The one thing I'm concerned about is scalability in terms of websocket connections and number of opsdroid chats that can be going on simultaneously at a given time.

I know that we can build custom connectors in Opsdroid and that could be a route to go to but what other parts of Opsdroid would need to change in order to build chatbots at scale?

Something I was thinking about incorporating was support for scaled websockets like SocketCluster or SocketIO and maybe building a modular layer that can rest on top of that but would support orchestration and middleware aspects of their scaled websocket solution.

The other thing that I would want to incorporate would be something that we can add as a modular but not necessary addition to Opsdroid which would account for frameworks like Terraform where we can designate cloud infrastructure within our Opsdroid deployment and have that serve as a blueprint for what instances to spin up and down as we scale dynamically.

I would welcome the opportunity to help out with this. I'm also on a deadline for work that would benefit from this and it would help me sell the use of Opsdroid at work.

I'm a big fan and I just would want to see this expand and help people realize that this is the proper way to deploy chatbots.

Thanks again and much appreciative of all your work with this,

-Brian

Sep 26 '21 04:09 bparbhu

Great to hear from you @bparbhu! Glad you like opsdroid!

We discussed something similar to this in #299 and ultimately decided that it was something we didn't have time to explore back in 2018.

The goal here I guess is to have multiple instances of opsdroid running in some kind of auto scaling group.

Ultimately I think websockets are the wrong technology choice here for trying to scale things. While some of the connectors and debugging modes use websockets, many use a REST API model where opsdroid is exposed to the internet and listens for incoming messages that are sent via a webhook from the chat service. In terms of scalability I think webhooks are the right model.

The other challenge is around the state and timers held within opsdroid as it is running. State can be solved reasonably easily by using an external database and opsdroid's memory from within your skills. If a user chats with opsdroid and each message goes to a different instance then memory can be used to manage state.

Cron timers could be the biggest issue because if any of your skills use cron then every instance of opsdroid will run the skill resulting in duplicated actions. To resolve this we would need to use some kind of distributed state store which can handle leadership election, and only the leader would run cron skills.

Sep 27 '21 10:09 jacobtomlinson

HI @jacobtomlinson ,

Thanks again for the feedback and direction on this, I appreciate it greatly! I just want to further spec out what these ideas could be and the questions at large that could be here.

So ultimately the connector you would need to have is an REST API solution that needs to be either Push or Pull. I think my own use case would require a Push config but I can see that it might be beneficial to have both kinds of solutions and run them when you need to. Also, this REST API would need to consume Webhook-based incoming messages from any chat service.

So given that, what would a set of design choices be for either a Push REST API or Pull REST API that would be mindful of privacy, security, and be extensible so that people can add customizations around an instance of Opsdroid? We can do a design doc that would be best practices for anyone thinking about making a REST API this is what you would need at a minimum but ultimately the control of the API design would lie with the user.

Also, what kind of fixtures need to be present in terms of regularly testing out that REST API and ensuring that there's no leakage and that user's information is encrypted from the start, as well as testing the different aspects of that REST API?

Would we be saying to users these should be the standard set of tests you should run with any REST API or these are more specialized tests that pertain to chat applications and that they should be regularly checked and here they are?

The next piece would then be, breaking down how do we incorporate or insert modular ways to specify infrastructure with regards to pieces of Opsdroid that need it when they do and how does that infrastructure specification get executed?

Some of the examples you mentioned were a distributed state store for cron-timing based skills and also clusters for dynamically dealing with state and memory with regards to messages. Though, linking infrastructure can be either a cloud-based service or a proprietary solution that involves on-site infrastructure for those running solutions that need to be close to the chest. So giving someone the option to point to both local and cloud-based infrastructure is key to making sure people don't get locked into a particular service they don't like.

It would be good for me to understand what kind of infrastructure is needed at each point of the event flow to help scale out each piece and then based on how each piece benefits from that infrastructure, what are possible services that can be utilized to address each concern and what variety of services should we support?

The more the better obviously but for an initial scalability test, what would be the easiest to integrate for or what would be the most widely used solution per infrastructure case?

What solutions do you see playing well with these kinds of infrastructure choices?

There was mention of using etcd or Consul in that previous clustering discussions but do these solutions provide orchestration support like Consul or would there need to be some layer that a user must write when dealing with clustering each time. In such do we make orchestration of resources another layer that would sit on top of any infrastructure piece where it's relevant or just link to services that incorporate orchestration so that users don't have to deal with that?

I think probably starting a design doc would be first step I think for this.

Let me know what you think,

-Brian

Sep 27 '21 16:09 bparbhu

Cron timers could be the biggest issue because if any of your skills use cron then every instance of opsdroid will run the skill resulting in duplicated actions. To resolve this we would need to use some kind of distributed state store which can handle leadership election, and only the leader would run cron skills.

I think the answer to this problem is to externalize the timer. In a horizontally scalable OpsDroid install, I think the timer message/event should be triggered outside of OpsDroid. So, a skill that uses a cron trigger, would either trigger an error saying that only web hooks or external triggers can be used, or it should rely on an implementation that effectively creates a special webhook instead of a timer and then an external service would be tasked with hitting that webhook at the appropriate times.

Taking that one step further, I can imagine a heartbeat webhook. An external service would hit that webhook on a schedule, say 1x per minute (which can be as fine or coarse as desired). That webhook request would be load balanced as would any other webhook request. When that heartbeat is received anything that was scheduled to occur in the preceding minute gets run by the instance that happens to be handling that heartbeat webhook.

Sep 28 '21 05:09 cognifloyd

So ultimately the connector you would need to have is an REST API solution that needs to be either Push or Pull.

I'm not sure I understand where you are going with this.

If you look at existing connectors like Facebook, Github, Slack, etc they all work this way. Opsdroid has a REST API and those connectors listen for incoming messages. If you run multiple instances of opsdroid behind a load balancer this would work just fine, with the exception of the cron timings which I've already mentioned.

You seem to be suggesting building another connector, which chat service would this be for?

Sep 28 '21 09:09 jacobtomlinson

Ah ok, so I didn't know that there was a REST API for Opsdroid, my apologies. Before knowing that I thought we would have to build the REST API so that's what I was thinking. Very cool, I'll check it out. Thanks again @jacobtomlinson and @cognifloyd , much appreciated!

Sep 28 '21 14:09 bparbhu

Something else, I've been thinking about is also what would a good database solution be for running OpsDroid at scale in terms of persisting chats at scale. I've been thinking about MongoDB as a solution but I was also looking at AWS DocumentDB which mimics a Mongo Database. Here's a link to their page https://aws.amazon.com/documentdb/. Do you think that would still be compatible with Opsdroid MongoDB database support?

Sep 30 '21 15:09 bparbhu

Yeah we already support MongoDB, so if it follows the same interface it should be fine.

Oct 05 '21 10:10 jacobtomlinson

Awesome! I'll let you know if that works as well.

Oct 06 '21 04:10 bparbhu

Also, I was looking at the Docker Swarm installation compose file. Does anyone know if that same compose file can be used with an AWS service like Fargate or the AWS container service?

Oct 13 '21 21:10 bparbhu

Or if anyone knows of a good service that helps you with docker swarm clusters I'd be happy to go with them.

Oct 13 '21 21:10 bparbhu

Discussing services for running Docker Swarm is getting a little off-topic for this issue about scalability in opsdroid.

Maybe head over to our chat for this kind of thing instead.

Oct 14 '21 12:10 jacobtomlinson

Ah, will do, sorry about that @jacobtomlinson.

Oct 14 '21 14:10 bparbhu

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

Apr 16 '22 08:04 stale[bot]

opsdroid opsdroid copied to clipboard

Adding scalability features in Opsdroid

opsdroid
opsdroid copied to clipboard