makara Makara Version 0.4.0 upgrade resulted in most requests going to master

Today we tried upgrading Makara to 0.4.0. For some reason, in our case, it resulted in what appeared to be a lot more queries going to master than before. I am providing two graphs of particular interest:

The first one is the CPU of the databases: one master and two replicas. Two vertical lines are deploys: first turning Makara 0.4.0 on, and then second one, reverting back to 0.3.10.

makara-0_4_0

The second is the cache hit rate for our memcached server (which we use as a Rails cache, and would have been used for context storing with Makara < 0.4.0). You can see that once we switch, hit rate went up considerably, likely indicating that memcached is now only used for caching, not caching + makara context.

makara-memcached

I looked through the code trying to understand how does Makara decide on whether to stick a connection, and found it to be a bit confusing. If I choose force_master! with sticky disabled, would it send the query to the master or not? See this issue #205 for additional context. In our case, however, stickiness was enabled. Here is the config file:

production:
  adapter: postgresql_makara
  encoding: utf8
  host: localhost
  username: XXXX
  password: XXXX
  reconnect: true
  pool: 25
  port: 5432
  prepared_statements: false
  makara:
    blacklist_duration: 30
    master_ttl: 5
    master_strategy: round_robin
    sticky: true
    connection_error_matchers:
      - !ruby/regexp '/pg::error: : select/'
      - !ruby/regexp '/no more connections allowed/'
      - !ruby/regexp '/result has been cleared/'
      - !ruby/regexp '/no connection to the server/'
      - !ruby/regexp '/the database system is (starting up|shutting down)/'
      - !ruby/regexp '/reset has failed/'
      - !ruby/regexp '/connection not open/i'
    connections:
      - name: master
        role: master
        database: master
      - name: replica1
        role: slave
        database: replica1
        weight: 1
      - name: replica2
        role: slave
        database: replica2
        weight: 1

So I wanted to document our case, because it nearly took our site down, as the master got quickly overloaded. Perhaps stickiness was broken in our case to begin with, and then perhaps it started working? Or, maybe we always run under an active record transaction within a request context? Not sure.

We can try again with a much shorter ttl, like 1 sec. I found this discussion in #162 relevant.

My thinking is that choosing to use stickiness or not, should probably be separate from whether or not it's possible to force master connection in a given context within the app. Perhaps I am missing something there.

Down the road I am thinking about adding a helper or two of this sort:

def try_master_if_blank(connection, &block)
    yield(connection) || with_master(connection) { |connection| block.call(connection) }
end

May 31 '18 04:05 kigster

I was thinking lately about this as well, but also because it'd be useful to have the opposite too. That is, a way of forcing some queries to go always to the secondary/replica, independently of the context. Perhaps something like:

proxy.on_master(sticky: false) do  # sticky: true by default, if the config is set to stick
  # Stuff that will always go to primary
end

proxy.on_slave do
  # Stuff that will always go to read replicas
end

The one for the read replicas might be useful in cases you don't care about staleness and have to perform heavy load read queries that you don't want the primary to be busy with.

Jun 10 '18 07:06 rosa

Intersecting idea, though makara always sends SELECT queries to replica unless it’s stuck to master.

Jun 11 '18 16:06 kigster