Feature request: Drop replication connection when there's no active shapes and there's not a bigger backlog of wal to process
One annoying thing about electric right now is it keeps a constant connection to the database. Which for scale to zero services like neon and Prisma, they can never scale down which makes them a lot more expensive for low traffic projects.
What we can do is drop the connection when there's no shape subscriptions and periodically check the wal size and wake up the publication to process data when it grows to large (as well as when there's active shapes).
This would allow the DB to spin down keeping the db cheap to run.
But wouldn't checking the WAL size periodically cause the database server to wake up anyway?
But wouldn't checking the WAL size periodically cause the database server to wake up anyway?
I'm assuming that if this check is infrequent enough this could still work (e.g. once per day or every few hours or something like that, not sure how the "scale to zero" works for these various providers), but there won't be a persistent active connection at least?
@msfstef From the original description:
periodically check the wal size and wake up the publication to process data when it grows to large
this is highly dependent on the write rate in the user database. Once an hour might not be enough in some cases. So setting some default periodicity might be inadequate for some users while fine for others.
Most managed Postgres services provided some sort of visibility into the replication lag, i.e. how much WAL is not getting discarded because it is claimed by a replication slot. Ideally, we would hook into that and wake Electric up whenever a threshold of WAL buildup is reached. But in practice I don't see that happen.
A more practical approach would be to maintain statistics of the incoming flow rate of transactions in our replication client and set a dynamic "wake up alert" based on that before closing the replication connection to let the database server scale down. But we'd still maintain the compute active without a more direct integration with the runtime platform that would have allowed the Electric process to shut down as well until woken up by an incoming HTTP request or an alarm.
not sure how the "scale to zero" works for these various providers
Neon scales down after 5 minutes of inactivity. Prisma Postgres is the only other scale-to-zero service I know of — they don't seem to say anywhere how long they wait but presumably it's pretty quick too.
So checking once every two hours would allow it to be inactive ~95% of the time assuming no other use.
We could also probably just check their API to see if the instance is awake — that wouldn't wake it up. And if it's down, we just let it keep sleeping.
Most managed Postgres services provided some sort of visibility into the replication lag, i.e. how much WAL is not getting discarded because it is claimed by a replication slot. Ideally, we would hook into that and wake Electric up whenever a threshold of WAL buildup is reached. But in practice I don't see that happen.
Yeah Neon doesn't seem to have an API for getting this metric (atm anyways).
The first iteration of this is implemented in https://github.com/electric-sql/electric/pull/3126.
The next step is to figure out a better strategy for waking up the DB connection. The dumb periodic check works in dev but in a production setting it will lead to replication lag for burtsy workloads if Electric "sleeps" during the time lots of new transactions affecting existing shapes are commited in Postgres.
The "waking up" must be triggered by new data getting inserted into the DB but we don't have any side channel to learn about that other then waking up the database to check the WAL size. Here are some options I can think of:
- during normal periods Electric keeps track of the write statistics such that when it scales down it can estimate the best point in time to wake in the future and check for new transactions
- expose a new endpoint in the HTTP API to cause Electric to "wake up". This can be used by developers to hook up provider-specific WAL metrics to webhook that hits Electric's endpoint before the WAL buildup exceeds developer-controlled threshold
- when a new shape request arrives (one for which we have no shape log), immediately wake up and materialize the shape
I like waking up on the new shape request. Though I think we should keep the connection open as long as the source is getting any shape requests. E.g. imagine a scenario where someone has an app open but is only periodically doing writes e.g. every 10 minutes. It'd be very odd if the replication was torn down during those pauses as then sync would suddenly "break" next time they did a mutation.
The feature has value for personal/dev projects in the case that the main user goes away, the application stops being used. Without scale to zero, Electric continues spending compute resources by virtue of keeping a connection open to Postgres.
My opinion is that the feature should be as transparent as possible. We should always show fresh data independently of electric being connected to Postgres on that moment or not, at the cost of latency.
It would be reasonable to wake up on live requests. I get that this will end up consuming compute time if someone forgets the application running, or if there is no new data in the publication, but I would be very annoyed if electric is showing me "up-to-date" when it is serving stale data. Having an heuristic to determine if there is new data on the publication, potentially with a one-off request to the database is an interesting improvement to avoid establishing a replication connection.
I understand the perspective that during development we might actually not inserting any data into the database and we could be saving compute during all that time, but I think it's a reasonable expectation that electric is connected to the database when handling any request and this corner case will be a lot less likely than a sporadic request being intended.
Fine, this is the easiest strategy to implement. I never thought this was the crux of the feature request because we have almost all of the pieces already in place: all shape requests go through a stack readiness check and in my PR I've already made StatusMonitor aware of the scaled down mode of operation, so it's a matter of restarting the supervision tree whenever a request checks the state of the stack and holding for a bit while the database connections come online.
As for the part where it's annoying to get an "up-to-date" response while Electric is not actively replicating from Postgres, we're going to deal with it anyway as we make Electric's read and write capabilities more independent. During a deployment, for example, if the new Electric version starts up and is able to start serving shape requests, it can start doing so but its replication connection might not be fully initialized by then.
I was thinking of addressing this by adding another field to the health endpoint's response, or rather deprecating the current status and replacing it with separet api_status and replication_status. Additionally, the HTTP response to a shape request can include an additional header in cases where the response was served from the shape log with no active replication from Postgres at the time.
As for the part where it's annoying to get an "up-to-date" response while Electric is not actively replicating from Postgres, we're going to deal with it anyway as we make Electric's read and write capabilities more independent.
It's different if this only happens for a small period during hand-over where it can be perceived as latency. If we can't make that gap small enough, I think it would make sense to add some more info to the control message.
Adding the information to the health status would make it more complete, if it's already easy to extract the info. Thinking of it, it would be great to show the replication/connection status on the user dashboard. Don't need to do it now.
Hi all, I was following along and thought I could contribute the perspective of a potential user.
I would love to use Electric in our project but the price is holding me back right now. I also don't want our local dev env, our Vercel Preview environments, and prod to diverge in functionality.
In my team, each developer has their own Neon DB cloud instance that is used for development. We are also using Vercel preview environments where each environment get it's own Neon DB branch (priced same as a separate DB). The smallest compute of Neon DB is 0.25 vCPUs and keeping one db/branch constantly running costs about $25 per month. Also, the dev and preview environments generally don't cause bursty write workloads.
I am not too worried about prod compute cost but I am worried about spending per developer and per preview environment.
I am wondering if this feature could be scoped as a developer setting with a clear description of the limitations (e.g., regarding WAL buildup)?
Re
very annoyed if electric is showing me "up-to-date" when it is serving stale data.
I have to say I am not following this. If there is an active shape subscription the replication connection should be active... and I would be (happily) paying for it.
Side note, this is the Neon DB code that controls scaling to zero. I wanted to see if checks the mere presence or a replication slot or if if checks that the slot is active. Looks like this part should be working.
// Don't suspend compute if there is an active logical replication subscription
//
// `where pid is not null` – to filter out read only computes and subscription on branches
const LOGICAL_SUBSCRIPTIONS_QUERY: &str =
"select count(*) from pg_stat_subscription where pid is not null;";
https://github.com/neondatabase/neon/blob/main/compute_tools/src/monitor.rs#L295
Hey @felix-quotez . Thanks for sharing your perspective!
I have a POC implemented in https://github.com/electric-sql/electric/pull/3126 and I have tested it with Neon to make sure it lets Neon shut down its compute when there are no shape requests from clients for the configured duration. And as soon as a new shape request comes in, Electric opens new database connections to restore the logical replication from PG.
Still need to address a few points of contention with the Electric team. I think it'll be ready next week.
@felix-quotez Just letting you know that is feature is implemented in main. You can start using it by pulling electricsql/electric:canary from Docker Hub.
The documentation preview is available here.