Find a way to have multiple instances of DBOS
Our application is designed to be able to connect to multiple databases simultaneously. It allows for single-tenancy databases with multi-tenancy programs. To achieve this, connections to the databases may be initiated when a request is received, then saved in a global pool for reuse.
The design of DBOS SDK prevents such a use case, as the connection is managed by a singleton.
I don't yet have a suggestion for a better API (I need to think about it more). The ability to customize connection management, instead of changing the current API, would cover that advanced use case.
Hello @yacinehmito! Thank you for your interest.
This case is actually quite common and the solution is to use DBOS steps for queries to such databases. You can create the connection in a global variable, a DBOS configured instance or, for some cases, inside the step itself.
DBOS transactions are a special case of steps that manage queries to one designated Postgres "user database". In some cases this is the "main db of the app." In other cases - it's not used at all. You can have a DBOS app that only uses workflows and steps.
Does this help?
Yes it does, thank you. Is there an example or piece of documentation that would showcase this? If not, I'll try myself and contribute with some.
Yeah, it looks like we could use more examples in this area.
The closest thing we have is a few discord conversations, some in Python. For example, see https://discord.com/channels/1156433345631232100/1166779411920597002/1362096559176814743
Would love to work with you on an example. Let us know what you need.
Thank you very much. I had a look at the conversation. It seems like it doesn't fit.
My issue is more related to having a system database per tenant. Input of steps are recorded in the database and our data isolation requirements are very strict. Every dependency in our backend is somewhat injected as part of a tenant-specific context, whereas DBOS really wants to have a single handle.
I had a look through the codebase. I saw that DBOSExecutor is also a singleton. It's not clear to me why that is.
So, with TSv3 we have the new datasource architecture which allows database connection code to be plugged in to the executor. As long as you can tell from a transaction call which db connection/pool is supposed to be used, I think it would work to have a DS that fronts multiple DB pools. Maybe we can work together to try that?
I think it would work to have a DS that fronts multiple DB pools
What is DS in that context?
What is DS in that context?
Datasource ... see https://docs.dbos.dev/typescript/tutorials/transaction-tutorial for the doc, we are currently in the process of documenting how the datasource is coded: https://github.com/dbos-inc/dbos-docs/blob/chuck/plugins/docs/typescript/reference/plugins.md
The above references describe how you get a transaction using transaction config options. The transaction options could be extended to say which DB is to be used and everything else should work.
Do you have any details about how you open the DB connections? What info describes which database to connect to and how you derive the connection string / credentials?
I am a bit confused by the mention of data sources. It seems to be to support better guarantees when integrating with, well, a data source, whereas my request is related to having multiple system databases for DBOS itself.
As for how connections are managed on our end:
- When a request is received by the server, the subdomain indicates which organization (i.e. tenant) is being served.
- Depending on the organization, we'll connect to a particular database.
- The connection is taken from a global connection pool. Each organization has its own pool of connections. That's just an optimization though.
What I would expect is the ability to launch DBOS multiple times against different databases in the same process. We would have as many executors as we have databases; or not, we can also distribute this across different processes. Flexibility to manage when and how DBOS connects to its system database is what I am getting at.
I don't really expect the current API to change. It's fine for many use cases. What I am suggesting is making the assumptions that there is a single instance less strong across the codebase so that multiple DBOS instances could be launched. This makes decorators probably unusable under that setup, but it's fine for a lower-level API in my opinion.
OK, so the original "ability to customize connection management" description led me to believe that you just need multiple application databases, so let me try to provide an example of this database setup. This would still hit the bullet points above with regard to where the application's data is stored.
Stepping back, "multitenancy" is always something that can be implemented at different levels, with different isolation guarantees, implementation difficulties, and operating prices.
- Multiple tenants in same tables, with identifiers in rows, hardware completely shared and all the implementation complexity falling on the software. This is the most sharing, lowest operating cost, but also lowest isolation. Note that this design is often coupled with the operational ability to split out high value / high demand customers into separate deployments... just because one deployment could theoretically hold everything doesn't mean it is done that way.
- Multiple databases, shared everything else. Very similar to 1, important if the DB is the bottleneck or needs to be separated off to other servers, but otherwise very similar to 1 above. Easy in this case to change which servers handle which DBs for special tenants.
- Separate deployments (containers, VMs, etc.) on the same machine / hardware pool. Simplifies app development somewhat and leaves the configuration/operation tier to handle the tenant tracking.
- Separate physical hardware. App knows nothing about it, most isolated, but also the least sharing and generally has higher operational costs as a result.
(And combinations thereof... of course if you do plan 4 then someone will get the idea to virtualize your hardware and use shared storage etc. and you end up back at 3, or if you try 1, 2, or 3 the ops people will make a separate deployment for key customers anyway.)
I was thinking you were headed toward 2 but please clarify.
Following up on the thought of multiple DBOSs in the same VM, I'm not sure what value that offers on the spectrum above. You can run multiple node.js processes, one per DBOS, and the additional isolation that would offer has some benefits; node.js is really for a single core and has a single memory space, so there's a built-in scalability/isolation bottleneck which suggests each DBOS have its own process... if you don't need more isolation than that offered by a single node.js, then you may as well share everything except the app DB.
Thank you for this complete write up.
For our product, Fabriq, we're doing or plan on doing all 4 levels of multi-tenancy, as you've outlined, depending on the customer requirements. We prefer when things are as much shared as possible, and selectively isolate as requested.
We have some customers that really don't want any of their data to end up in a shared logical database, but they don't care about sharing compute. In this instance, that would be the level 2 as you listed. (To be even more complete: for legacy reasons, we have to provide a dedicated infrastructure—so level 3—for customers that only need level 2. That costs us quite a bit of money. For the last couple of years, as a cost-saving measure, we have been designing a new backend so that we can easily do level 2. Including DBOS in that is a head scratcher, as I explain below.)
The problem to implement level 2 with the current design of DBOS is that, as I understand (I may be wrong), the inputs of steps are persisted in the system database. Those inputs can be arbitrary data, including data that is sensitive for our customers. In that case, we'd want to ensure that those customers have their own system database (as a logical database, not necessarily as a different Postgres cluster).
You can run multiple node.js processes, one per DBOS, and the additional isolation that would offer has some benefits; node.js is really for a single core and has a single memory space, so there's a built-in scalability/isolation bottleneck which suggests each DBOS have its own process... if you don't need more isolation than that offered by a single node.js, then you may as well share everything except the app DB
In our view, DBOS system database is as much "application" as the rest of what we persist, given the nature of the data that is persisted there. If it was just technical stuff, we wouldn't mind at all.
Having one process per customer is definitely doable on our end, but that upends our current architecture where, by dependency injection, we have completely decoupled our workloads with the backing storage capabilities (which would include both the application database and DBOS system database, as well as others). The multi-tenancy level 2 is completely handled on the application side, and now we would have that concern leaking to the infrastructure side, which is what we wanted to avoid in the first place when we designed our new backend.
I wish to clarify that I am not really contesting DBOS' current design, nor am I trying to champion my use case so that you support it. I am more wondering whether having the executor be a singleton in an inherent design limitation, or if this is something that we can relax for some low-level API. If so, I'd be more than happy to contribute.
From what I understand, having it be a singleton is necessary to support decorators, as those would be directly imported and used "top-level" when tagging functions and methods. I feel like the function / callback API (don't know how to call it) doesn't need a singleton. One would just instantiate a DBOS instance at some point and pass it around. What do you think?
Just trying to understand the requirements here myself. That helps a lot.
The system DB doesn't have step inputs, but it does have step and other outputs and also the workflow inputs, so it could be considered sensitive by the customer... fair enough to want each tenant to have a separate system DB in addition to separate app DBs.
It's currently not just the decorators (and other registrations) that treat DBOS as a singleton, but all the DBOS stuff... a random selection would include getEvent, getResult, sleep, retrieveWorkflow, etc... The DBOS API could be extended to allow instance access rather than static I suppose, or there could be a DBOS instance in your context, but I would have to think about the best way to do it. There are also a lot of auxiliary infrastructures that treat this like a singleton, all the stuff that does tracing / admin. Some things, like scheduler, kafka consumer, HTTP, and other endpoints, do you get one per instance or is there one main instance that handles it or are they just not applicable in the multi-instance scenario?
It's not really a big challenge technically to allow multiple executor instances, the main consideration was just developer API simplicity... Just thinking out loud about what it means... not arguing one way or the other at this point.