sqld
sqld copied to clipboard
[Proposal] Improve replication status by providing minimum stats
sqld provides a way for doing replication from primary to multiple readers, and Today there is no an easy way for checking whats the status of the replicas, for example if they are lagging behind (read more about this later in future work).
Other databases like postgres implement some sort of observability which provides a good detail of the replica status from the point of view of the master, see pg_stat_replication
[1]
Proposal
Implement new stats entries for the stats file for providing status of all of the replicas that are and were connected to the DB by providing the following information:
- replica_id that would provide the unique id of the replica. This would require to change the current proto for adding the id of the replica. Also should be considered how replica id could be made persistent at the replica side
- replica_ip that would provide the replica IP address
- replica_port that would provide the replica IP address
- last_frame_no_sent that would provide the latest frame no returned to the replica
- last_frame_no_flushed that would provide the latest frame that was effectively flushed, ideally last_frame_no_sent should have the same value. Today unless Im missing something there is no way to know if a replica has effectively applied a specific frame. This would require most likely to change the protocol to bidirectional stream and also the hello protocol for announcing the latest frame no flushed.
- frame_no_flushed_lag that would provide the total lag that the replica is incurring considering the latest flushed flag announced.
- last_time_seen. Timestamp with latest time that the replica was seen
Some open questions
The proposal if accepted all of it has some open questions about how to make some changes that might considered breaking changes or at least none trivial:
How we can store the replica_id at replica side without losing it through restarts?
I would be tempted to generate this field automatically once and store it in an internal and read only DB. Users would identify replicas through the replica_ip and replica_port but what would identify internally and univocally a replica would be the replica_id.
This also would require some changes at proto level for at least provide the replica_id
Identifying properly latest frame flushed at replica side
This would require some changes, at the proto level for sending as a response the ack of the latest flushed frame no, like
rpc Snapshot(stream LogOffset) returns (stream Frame) {}
Notice the usage of the stream as a request.
This would impact the total number of gRPC messages for the replication, as such we might want to avoid having one ack per frame and implement some kind of thresholds for committing a frame no after some time or after some number of processed frames.
Future changes on top of this one
- Moving the stats from a JSON file to a libsql database or a system table of the existing DB would provide a way for gathering the statistics of a
sqldservice remotely and using traditional tools like HTTP or a PG driver. This interface would look like more similar to what is provided by Postgres Today. This change will also unlock the following one. - By having the state of the replica stored in a centralised place and also as a simple DB table, this would allow us to implement later on monotonic reads for replicas, once also the stats table would have been also replicated. Replicas could return results to the user considering only the minimum frame no flushed by all of the replicas.
Thanks for the proposal @pfreixes! @MarinPostma could you please comment?
The idea sounds good, I'd even go as far as starting with storing this data in an internal sqld-specific table. That would make all the info available for everyone right away. We should just make sure that our first internal table for sqld also has a column that indicate its version, so that we keep it backward compatible
Oks, I'll try to start working on this one, will take me a while since my experience with Rust is very basic. Also we can go into the details like the version one once I have a minimal draft and I have a better picture of all required changes.
Hello @pfreixes, thanks for your proposal! I'm not opposed to collecting and displaying this information to the user, however, there are a couple of points that need to be discussed:
Persistence of the replication state
I'm not convinced that this information needs to be persisted. I think it can be useful to have information on the running cluster. If a node is a member of the cluster, then it'll try to connect to the primary. The corollary to that is: if a node doesn't connect to the primary, then it's not part of the cluster, and therefore should be ignored.
How we can store the replica_id on the replica side without losing it through restarts?
I don't think that this is necessary for a model where we don't persist replication data anyways. However, it could be useful to be able to identify a replica, by giving it a name. This name can be provided at startup, and proper configuration of the replicas can make it stable.
Identifying properly the latest frame flushed at the replica side
First of all, frames are applied in "batches" corresponding to a transaction. In the current implementation, the replica polls the frame stream, and apply the frames it received as soon as it detects a transaction boundary. If an error occurs, the stream is shut down, and if the error is recoverable, the replica will re-issue a call to frames. A good estimation of the flushed index is to consider any frame sent to be applied. Backpressure should prevent this number to be too far off, and any error occurring on applying that is recoverable will cause a call to frames that the primary can use to adjust its estimation.
In this model, the stats give you a relative idea of the health of the system, but I think it's a good tradeoff, for reasons I will explain later.
Storing the meta in a libsqld table
If we decide that this information should be persisted, should it be replicated as well?
Using replication state for replication purposes, and monotonic reads
From the perspective of a connection (process), sqld already guarantees monotonic reads (and even stronger real-time guarantees). What you are suggesting is linearizability, which is a constraint on the system as a whole. I believe this is a far too strong guarantee to provide, and it comes with many technical challenges, to name a few:
- we have no control over up to what index is read, once the frames are applied, they become visible. There is no obvious way around that.
- a system that waits for all replicas to reach the same replication index is not guaranteed to make progress. Consensus algorithms, such as raft, enforce that the write is replicated to at least 1/2 of the participant before it can be applied, but raft is a single-reader model. You can relax the single reader with a quorum read for example, but it's still very costly and scales very poorly with the number of participants.
Hi @MarinPostma thanks for the feedback, let me try to add my thoughts to your comments
I'm not opposed to collecting and displaying this information to the user
I believe that we can try to scope the changes to whats in line with the sqld vision at mid and long term, let's try to find out what could be this first step on that direction.
I'm not convinced that this information needs to be persisted. I think it can be useful to have information on the running cluster. If a node is a member of the cluster, then it'll try to connect to the primary. The corollary to that is: if a node doesn't connect to the primary, then it's not part of the cluster, and therefore should be ignored.
Problem with that strategy is that any observability tooling that you could built on top of this could produce misleading information during connectivity issues, at the moment that you would not be adding them as part of the stats output they would not be considered. For example if you have a replica that gets disconnected - for any reason - and you have an alarm that is triggered when the lag between the replica and the master goes beyond X, during disconnection events the alarms wouldn't be triggered.
The good part of having state is that you provide the real picture of the state of your replicas. As a downside effect if you want to remove replicas from cluster you might need to consider to remove them also from the state "manually".
There is a third solution which is also emitting the proper stats from the replica side, and implement the stitching of the primary and the replica in a different component - the user should implement the logic on their side, but this comes with other kind of problems since most of the observability tools Today are not designed - but still doable - for making this stitching, just for make it understandable the process of measuring the lag between the replica and the master could look like this lag = frame_no_primary - farme_no_replica. As I said I do not like this solution, or at least in the past gave to me a lot of headaches.
If you believe that sqld should avoid having state for the replica status then i would not go for it, but Im wondering if the stats of the replicas would be adding any value at all if they could be misleading.
I don't think that this is necessary for a model where we don't persist replication data anyways. However, it could be useful to be able to identify a replica, by giving it a name. This name can be provided at startup, and proper configuration of the replicas can make it stable.
Using a name provided by the client that is generated automatically any time when it starts or by hand by the user will make things easier. I guess that the idea of having a fixed replication_id which would be resilient - value would not change - to restarts came due to the requirement of having an state of the replica that could be trustable. If we are finally not implementing any state and disconnections are treated as if a new replica would join to the cluster, we can give whatever id/name that we want.
IMO better to solve this question once there is a final call with the state.
First of all, frames are applied in "batches" corresponding to a transaction. In the current implementation, the replica polls the frame stream, and apply the frames it received as soon as it detects a transaction boundary. If an error occurs, the stream is shut down, and if the error is recoverable, the replica will re-issue a call to frames. A good estimation of the flushed index is to consider any frame sent to be applied. Backpressure should prevent this number to be too far off, and any error occurring on applying that is recoverable will cause a call to frames that the primary can use to adjust its estimation.
Totally agree that is a good estimation, I guess that the proposal of having also the latest flushed frame (which I guess that would match with the transaction boundary) was due to the future option of implementing solutions that might require stronger consistency models. More about this later.
The addition of the new last_frame_no_flushed parameter could be totally for now removed from the proposal and be added later on in a different proposal for supporting different consistency models.
If we decide that this information should be persisted, should it be replicated as well?
Good question, we might have three options here:
- No, they are filtered out from the replication process (not sure how easy this would be)
- Yes, they are replicated by design since they are just tables. We should see what would happen with other tables that could be also used by the replicas for storing statistics, we do not want to have the replication process overwriting those values.
- Yes, but only the replication state for implementing stronger consistency models. More about this later
Again here a first step could be either not having state or just filtering the tables out for now. Further proposals like the one that would imply replicate the replication state could go into its own proposal.
From the perspective of a connection (process), sqld already guarantees monotonic reads (and even stronger real-time guarantees). What you are suggesting is linearizability, which is a constraint on the system as a whole. I believe this is a far too strong guarantee to provide, and it comes with many technical challenges, to name a few: we have no control over up to what index is read, once the frames are applied, they become visible. There is no obvious way around that. a system that waits for all replicas to reach the same replication index is not guaranteed to make progress. Consensus algorithms, such as raft, enforce that the write is replicated to at least 1/2 of the participant before it can be applied, but raft > is a single-reader model. You can relax the single reader with a quorum read for example, but it's still very costly and scales very poorly with the number of participants.
"Problem" with current master replica(s) architecture is that - correct me if Im wrong here- traffic that goes to different replicas can experience weird results, since replicas are not guarantee to be in sync. This problem becomes really hard to solve for example when you have the replica DBs distributed across all K8S nodes, so external traffic does not have stitching and can go at any moment to any pod (K8S node). Leveraging on quorum algorithms also does not work if you want to keep the traffic at node level, so you can make only one single read.
a system that waits for all replicas to reach the same replication index is not guaranteed to make progress.
Yes thats the main issue, but this can be tolerated on some use cases, so being preferable to return stale data but consistent rather than the latest write, like for example configurations changes or feature flags.
Again this is more like a future ask rather than a requirement Today, but its true that for implementing this consistency model the replicas would need to know whats the minimum offset flushed by all of the replicas.
First steps
Having the feeling that a good first step would be to extend the current statistic object, as was original proposed with the state of the replicas but without adding any information related to the latest frame flushed.
But Im still thinking on how useful this will really be, since will not provide an accurate view of the replicas since will not be resilient to disconnections of the replicas, neither to restarts of the primary. And will require that the system converges to the the final desired state ( master up and all replicas connected) which could not be neither guaranteed.
NOTE: the statistics could be still returned from a specific endpoint of the HTTP server like http://sqld_host:port/_stats which could return just the JSON file.
Thoughts?
The stats object currently contains per-instance stats (# row read/written) and is served on /v1/stats. Given that the cluster information is not an instance property, it would be best not to do it alongside other instance-specific stats. The master is actually the only instance that is collecting that information, so it should probably only be accessible from there, at least in the beginning.
An instance ID can be create as you suggested, and it can be stored in the replication metadata of the primary here
This id can be sent to the primary by extending the RPC handshake and calls here: https://github.com/libsql/sqld/blob/main/sqld/proto/replication_log.proto
from there, the primary can gather information on the replicas that have been connected to it since it was started.
The master can serve this information to some route, like /v1/cluster_health. All replicas would return an error not a primary on this endpoint.
This provided basic information on the cluster and gives a good idea of the state of the cluster at some point in time. From there, we can decide what to improve.
Oks, its a good first step