gofr icon indicating copy to clipboard operation
gofr copied to clipboard

En/db resolver

Open Umang01-hash opened this issue 4 months ago • 16 comments

Pull Request Template

Description:

What is DBResolver?

  • Adds a DBResolver module to GoFr, which provides automatic read/write splitting for SQL databases.

  • Read queries (e.g., SELECT) are routed to read replicas, and write queries (e.g., INSERT, UPDATE) are routed to the primary database.

  • Seamlessly wraps the existing SQL datasource: does not require any application code changes for existing queries.

  • Developers interact with c.SQL exactly as before; all routing and failover are fully transparent.

Motivation & Benefits

  • Scalable horizontal read performance with multiple replicas
  • Reduced load on primary database
  • Fault-tolerant: automatic fallback to primary if all replicas fail
  • Clean metrics and tracing support for operational visibility

Example Usage:

package main

import (
	"gofr.dev/pkg/gofr"
	"gofr.dev/pkg/gofr/datasource/dbresolver"
)

type Customer struct {
	ID   int    `db:"id"`
	Name string `db:"name"`
}

func main() {
	a := gofr.New()

	// Initialize DB resolver with default settings
	err := dbresolver.InitDBResolver(a, dbresolver.Config{
		Strategy:      dbresolver.StrategyRoundRobin, // use round-robin strategy or random strategy
		ReadFallback:  true, // allow reads on primary if all replicas are down
		MaxFailures:   3, 			  // number of allowed failures before marking a replica as down
		TimeoutSec:    30, // timeout for marking a replica as down
		PrimaryRoutes: []string{"/admin", "/api/payments/*"}, // routes that should go to primary
	})
	if err != nil {
		a.Logger().Errorf("failed to initialize db resolver: %v", err)
	}

	// Read endpoint - goes to replica
	a.GET("/customers", func(c *gofr.Context) (interface{}, error) {
		var customers []Customer

		c.SQL.Select(c, &customers, "SELECT id, name FROM customers")

		return customers, err
	})

	// Write endpoint - goes to primary
	a.POST("/customers", func(c *gofr.Context) (interface{}, error) {
		var customer Customer

		c.Bind(&customer)

		_, err := c.SQL.Exec("INSERT INTO customers (name) VALUES (?)", customer.Name)

		return customer, err
	})

	// Admin endpoint - forced to primary
	a.GET("/admin/customers", func(c *gofr.Context) (interface{}, error) {
		var customers []Customer

		c.SQL.Select(c, &customers, "SELECT id, name FROM customers")

		return customers, err
	})

	a.Run()
}

Configuration Example:

DB_HOST=localhost
DB_USER=root
DB_PASSWORD=rootpassword
DB_NAME=testdb
DB_PORT=3306
DB_DIALECT=mysql
DB_MAX_IDLE_CONNECTION=2
DB_MAX_OPEN_CONNECTION=0

# Replica hosts (comma-separated, e.g., on ports 3307 and 3308)
DB_REPLICA_HOSTS=localhost:3307,localhost:3308

Testing Strategy:

  • Primary and Replicas launched with docker-compose (primary:3306, replicas:3307/3308)

  • Replication automated by setup scripts; seed data from SQL dump and app endpoints

  • Load Testing: Performed with Apache JMeter simulating high-concurrency API requests (both reads and writes)

Results:

  • Zero error rate; throughput and latency showed no regression compared to the baseline.

  • Read/write split is fully performant, and scaling is achieved without application change.

Checklist:

  • [x] I have formatted my code using goimport and golangci-lint.
  • [x] All new code is covered by unit tests.
  • [x] This PR does not decrease the overall code coverage.
  • [x] I have reviewed the code comments and documentation for clarity.

Thank you for your contribution!

Umang01-hash avatar Aug 07 '25 08:08 Umang01-hash

@Umang01-hash Can you add more details on sequence of R(Read) and W(Write) cases under which read and write replicas would be selected ?

gizmo-rt avatar Aug 11 '25 07:08 gizmo-rt

@Umang01-hash Can you add more details on sequence of R(Read) and W(Write) cases under which read and write replicas would be selected ?

Hey @gizmo-rt we first determine is a query is read/write and if it is a read query we check if healthyReplica is available or not. If the replica is available we select it and send the read query to it and if the replica is not available we send it to primary database. Methods like QueryContext, Select have this logic. Methods like Exec , Prepare etc are directly using the primary db. Transaction methods (Begin, BeginTx) are always routed to the primary.

If all replicas are unhealthy at time of a read query and fallback is enabled, the read falls back to the primary. If fallback is disabled, the read fails with an error.

If a replica experiences multiple failures, it's circuit breaker opens and replica is temporarily skipped for queries. This timeout period is 30 seconds by default and we allow 5 failures for replica before cicuit breaker is opened.

For an example sequence like R,R,W,R:

1st R: Routed to replica

2nd R: Routed to next replica (depends on which strategy is choosen random or round-robin)

1st W: Routed to primary

3rd R: Routed to next replica

Umang01-hash avatar Aug 12 '25 05:08 Umang01-hash

@Umang01-hash Can you add more details on sequence of R(Read) and W(Write) cases under which read and write replicas would be selected ?

Hey @gizmo-rt we first determine is a query is read/write and if it is a read query we check if healthyReplica is available or not. If the replica is available we select it and send the read query to it and if the replica is not available we send it to primary database. Methods like QueryContext, Select have this logic. Methods like Exec , Prepare etc are directly using the primary db. Transaction methods (Begin, BeginTx) are always routed to the primary.

If all replicas are unhealthy at time of a read query and fallback is enabled, the read falls back to the primary. If fallback is disabled, the read fails with an error.

If a replica experiences multiple failures, it's circuit breaker opens and replica is temporarily skipped for queries. This timeout period is 30 seconds by default and we allow 5 failures for replica before cicuit breaker is opened.

For an example sequence like R,R,W,R:

1st R: Routed to replica

2nd R: Routed to next replica (depends on which strategy is choosen random or round-robin)

1st W: Routed to primary

3rd R: Routed to next replica

So if we have 10 replicas, then in the worst case with 5 retries and 30s timeout user has to wait 1500 sec i.e 25 mins before getting a response ?? Curious, observed request timeout is generally in range 30s - 5mins

gizmo-rt avatar Aug 12 '25 07:08 gizmo-rt

@Umang01-hash Can you add more details on sequence of R(Read) and W(Write) cases under which read and write replicas would be selected ?

Hey @gizmo-rt we first determine is a query is read/write and if it is a read query we check if healthyReplica is available or not. If the replica is available we select it and send the read query to it and if the replica is not available we send it to primary database. Methods like QueryContext, Select have this logic. Methods like Exec , Prepare etc are directly using the primary db. Transaction methods (Begin, BeginTx) are always routed to the primary. If all replicas are unhealthy at time of a read query and fallback is enabled, the read falls back to the primary. If fallback is disabled, the read fails with an error. If a replica experiences multiple failures, it's circuit breaker opens and replica is temporarily skipped for queries. This timeout period is 30 seconds by default and we allow 5 failures for replica before cicuit breaker is opened. For an example sequence like R,R,W,R: 1st R: Routed to replica 2nd R: Routed to next replica (depends on which strategy is choosen random or round-robin) 1st W: Routed to primary 3rd R: Routed to next replica

So if we have 10 replicas, then in the worst case with 5 retries and 30s timeout user has to wait 1500 sec i.e 25 mins before getting a response ?? Curious, observed request timeout is generally in range 30s - 5mins

No @gizmo-rt ! It doesn;t happens that way.

When a read request comes in:

  1. We check our list of replicas and only consider ones with closed circuit breakers
  2. If a replica fails, we immediately try fallback (if enabled) or return error

The 30-second timeout is how long a circuit breaker stays OPEN after failures, not how long each request waits. Once a circuit breaker opens, we stop sending traffic to that replica for 30 seconds. There is no chained retry across all replicas per request.

The system actively avoids replicas with open circuit breakers during routing, so those replicas are simply skipped.

A typical request flow with failing replicas:

Request comes in Check replica 1 → circuit open (skip) Check replica 2 → circuit open (skip) Try replica 3 → fails immediately Fall back to primary → succeeds

Umang01-hash avatar Aug 13 '25 09:08 Umang01-hash

@Umang01-hash Hope we are routing the reads within a transaction to the primary? I couldn't find any the corresponding code for it.

akshat-kumar-singhal avatar Aug 26 '25 06:08 akshat-kumar-singhal

Hope we are routing the reads within a transaction to the primary? I couldn't find any the corresponding code for it.

Yes @akshat-kumar-singhal Reads within transactions are always routed to the primary database because the transaction itself is created on the primary connection. Since Begin() returns a transaction object from the primary database, all subsequent operations on this transaction (including reads) automatically use the primary connection.

// Begin always routes to primary (transactions).
func (r *Resolver) Begin() (*gofrSQL.Tx, error) {
    r.stats.totalQueries.Add(1)
    return r.primary.Begin()
}

Umang01-hash avatar Aug 26 '25 07:08 Umang01-hash

This PR implements SQL-level routing (parsing queries) to split reads vs writes. Another approach is to use the HTTP method (e.g. GET=read, POST/PUT=write) to choose the DB. Given cases where a handler might do SELECT then UPDATE, or a GET-route that writes, which strategy do people prefer? Any thoughts on pros/cons?

@aryanmehrotra @ccoVeille @akshat-kumar-singhal any suggestions on it??

Umang01-hash avatar Sep 18 '25 15:09 Umang01-hash

This PR implements SQL-level routing (parsing queries) to split reads vs writes. Another approach is to use the HTTP method (e.g. GET=read, POST/PUT=write) to choose the DB. Given cases where a handler might do SELECT then UPDATE, or a GET-route that writes, which strategy do people prefer? Any thoughts on pros/cons?

@aryanmehrotra @ccoVeille @akshat-kumar-singhal any suggestions on it??

@Umang01-hash Any issues you see with the current implementation of identifying it based on the query? Another way to think of this implementation is by drawing a parallel to cache. Given that we'd want to limit the visibility of cache/read DB within the store layer, the store layer should be able to decide whether to use primary or replica/cache based on the operation (SELECT/UPDATE etc) and transaction.

akshat-kumar-singhal avatar Sep 22 '25 05:09 akshat-kumar-singhal

This PR implements SQL-level routing (parsing queries) to split reads vs writes. Another approach is to use the HTTP method (e.g. GET=read, POST/PUT=write) to choose the DB. Given cases where a handler might do SELECT then UPDATE, or a GET-route that writes, which strategy do people prefer? Any thoughts on pros/cons? @aryanmehrotra @ccoVeille @akshat-kumar-singhal any suggestions on it??

@Umang01-hash Any issues you see with the current implementation of identifying it based on the query? Another way to think of this implementation is by drawing a parallel to cache. Given that we'd want to limit the visibility of cache/read DB within the store layer, the store layer should be able to decide whether to use primary or replica/cache based on the operation (SELECT/UPDATE etc) and transaction.

One issue which might occur is if there is a POST request and user is first inserting the value and then fetching it. based on the query pattern which we have currently followed there are good chances that while fetching we will not get the resource, so the correct approach seems to be based on the request method type rather than query type.

aryanmehrotra avatar Oct 24 '25 08:10 aryanmehrotra

One issue which might occur is if there is a POST request and user is first inserting the value and then fetching it. based on the query pattern which we have currently followed there are good chances that while fetching we will not get the resource, so the correct approach seems to be based on the request method type rather than query type.

@aryanmehrotra While selecting the primary/replica based on request method type feels safer, we may still have issues due to replication lag. Also, in case of poor implementation (by application dev), example - DB writes in GET call, we can have other issues. We may want to consider supporting a config - list of end points (maybe) which would operate on the replica instead of primary. This would provide more control to the developers and heavier read operations can be safely routed to the replica.

akshat-kumar-singhal avatar Oct 24 '25 08:10 akshat-kumar-singhal

One issue which might occur is if there is a POST request and user is first inserting the value and then fetching it. based on the query pattern which we have currently followed there are good chances that while fetching we will not get the resource, so the correct approach seems to be based on the request method type rather than query type.

@aryanmehrotra While selecting the primary/replica based on request method type feels safer, we may still have issues due to replication lag. Also, in case of poor implementation (by application dev), example - DB writes in GET call, we can have other issues. We may want to consider supporting a config - list of end points (maybe) which would operate on the replica instead of primary. This would provide more control to the developers and heavier read operations can be safely routed to the replica.

We should document poor practices like DB writes in GET calls, since features like status code handling are based on request type and overriding them breaks best practices.

For routing queries, we could either:

Support a config with endpoints that always use replicas for heavy reads, or Let users specify which queries run on which DB.

This covers more use cases, but ideally we want a cleaner approach where devs don’t have to manage this manually. If not, either of the options works.

aryanmehrotra avatar Oct 24 '25 09:10 aryanmehrotra

Hey! Here are my sugegstions for the above problems/comments:

  1. Transaction Scoping: When developers use Begin() → Commit() (database transactions), all queries inside that block would automatically go to the primary database. This would solve the POST → INSERT → SELECT pattern because both operations happen on the same database.

  2. Read-After-Write Consistency: We could track when any write happens and for the next few seconds, automatically route all read queries to primary instead of replicas. This would handle cases where you INSERT data and immediately try to SELECT it outside of transactions.

  3. Why SQL-Based Is Still Better: HTTP methods (GET/POST) don't tell us what the actual SQL query does. A GET endpoint might cache data (write), or a POST might only validate data (read). Our SQL-based approach looks at the actual database operation (SELECT vs INSERT) rather than guessing from the HTTP method.

Question to all: Does the above approach sound reasonable? Would these two additions solve the consistency concerns while keeping the implementation simple? Happy to explore HTTP method-based routing if you think it's a better path forward.

@aryanmehrotra @ccoVeille @akshat-kumar-singhal @gizmo-rt - thoughts?

Umang01-hash avatar Oct 28 '25 07:10 Umang01-hash

Hey! Here are my sugegstions for the above problems/comments:

  1. Transaction Scoping: When developers use Begin() → Commit() (database transactions), all queries inside that block would automatically go to the primary database. This would solve the POST → INSERT → SELECT pattern because both operations happen on the same database.

I highly doubt if someone in a transaction would do POST → INSERT → SELECT, as mostly we do insert/update in transactions, SELECT happens separetely.

Also, if we think about the three layer architecture, there are different method for creating an entry and getting that entry, right now we can't do that in a single transaction in gofr.

  1. Read-After-Write Consistency: We could track when any write happens and for the next few seconds, automatically route all read queries to primary instead of replicas. This would handle cases where you INSERT data and immediately try to SELECT it outside of transactions.

It would be very expensive, imagine writes or updates happening from a cron-job every 15 minutes then mostly it would be happening in a particular table, then we would also need to be table aware in that case. otherwise even after having the read/write split most queries would go to the write replica.

  1. Why SQL-Based Is Still Better: HTTP methods (GET/POST) don't tell us what the actual SQL query does. A GET endpoint might cache data (write), or a POST might only validate data (read). Our SQL-based approach looks at the actual database operation (SELECT vs INSERT) rather than guessing from the HTTP method.

What does the best-practise say about GET and POST request endpoints functioning and what we already follow in gofr?

aryanmehrotra avatar Oct 28 '25 07:10 aryanmehrotra

@aryanmehrotra Thanks for your suggestions. Your points seem valid - transaction scoping wouldn't work in three-layer architecture, and global read-after-write tracking would be too expensive.

Request-Scoped Routing: Instead of query-level or global tracking, we can route all queries within a single HTTP request based on the HTTP method. So if someone hits POST /users, ALL database queries (both INSERT and SELECT) within that request handler automatically go to primary. GET requests continue using replicas.

This solves the POST → INSERT → SELECT consistency issue and aligns with REST best practices - GET should be safe (no side effects) and POST/PUT for state changes. GoFr already follows these patterns in its routing and status code handling. Thoughts on this approach?​

An extension on this as suggested by @akshat-kumar-singhal is: Route-Level Configuration: We can also provide optional endpoint-level control where developers can configure specific routes to always use primary:

# Optional: Force specific endpoints to always use primary
DB_PRIMARY_ROUTES=/users/search,/reports/heavy,/admin/*

Umang01-hash avatar Oct 28 '25 07:10 Umang01-hash

Routes will need to have the method as well. We would be better off accepting this configuration via code instead of complicating the ENV file. What we should ensure is the ability to switch seamlessly between primary-replica setup vs primary only (local).

On Tue, 28 Oct 2025 at 14:32, Umang Mundhra @.***> wrote:

Umang01-hash left a comment (gofr-dev/gofr#2140) https://github.com/gofr-dev/gofr/pull/2140#issuecomment-3455055575

@aryanmehrotra https://github.com/aryanmehrotra Thanks for your suggestions. Your points seem valid - transaction scoping wouldn't work in three-layer architecture, and global read-after-write tracking would be too expensive.

Request-Scoped Routing: Instead of query-level or global tracking, we can route all queries within a single HTTP request based on the HTTP method. So if someone hits POST /users, ALL database queries (both INSERT and SELECT) within that request handler automatically go to primary. GET requests continue using replicas.

This solves the POST → INSERT → SELECT consistency issue and aligns with REST best practices - GET should be safe (no side effects) and POST/PUT for state changes. GoFr already follows these patterns in its routing and status code handling. Thoughts on this approach?​

An extension on this as suggested by @akshat-kumar-singhal https://github.com/akshat-kumar-singhal is: Route-Level Configuration: We can also provide optional endpoint-level control where developers can configure specific routes to always use primary:

Optional: Force specific endpoints to always use primary

DB_PRIMARY_ROUTES=/users/search,/reports/heavy,/admin/*

— Reply to this email directly, view it on GitHub https://github.com/gofr-dev/gofr/pull/2140#issuecomment-3455055575, or unsubscribe https://github.com/notifications/unsubscribe-auth/APUGM5XV6RQAE4CJL7X3JMT3Z4WMBAVCNFSM6AAAAACDKIIUESVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTINJVGA2TKNJXGU . You are receiving this because you were mentioned.Message ID: @.***>

akshat-kumar-singhal avatar Oct 28 '25 09:10 akshat-kumar-singhal

@aryanmehrotra Thanks for your suggestions. Your points seem valid - transaction scoping wouldn't work in three-layer architecture, and global read-after-write tracking would be too expensive.

Request-Scoped Routing: Instead of query-level or global tracking, we can route all queries within a single HTTP request based on the HTTP method. So if someone hits POST /users, ALL database queries (both INSERT and SELECT) within that request handler automatically go to primary. GET requests continue using replicas.

This solves the POST → INSERT → SELECT consistency issue and aligns with REST best practices - GET should be safe (no side effects) and POST/PUT for state changes. GoFr already follows these patterns in its routing and status code handling. Thoughts on this approach?​

An extension on this as suggested by @akshat-kumar-singhal is: Route-Level Configuration: We can also provide optional endpoint-level control where developers can configure specific routes to always use primary:

# Optional: Force specific endpoints to always use primary
DB_PRIMARY_ROUTES=/users/search,/reports/heavy,/admin/*

Yes this seems better, also env variable shouldn't be used for this we should give option from the code itself, this can be part as an enhancement based on user feedback once we rollout the feature just based on the route level.

aryanmehrotra avatar Oct 28 '25 11:10 aryanmehrotra