makara icon indicating copy to clipboard operation
makara copied to clipboard

Adding with_master and release_master! helpers.

Open kigster opened this issue 6 years ago • 12 comments

This is just a concept, but i think an important one to have...

kigster avatar Jun 13 '18 09:06 kigster

We have a similar need and ended up with this ActiveRecord::Base helper which scopes master sticking to a block. Feel free to adapt it:

ActiveRecord::Base.class_eval do
  def self.on_master
    previously_stuck = self.connection.send(:stuck_to_master?)
    self.connection.stick_to_master!(true) unless previously_stuck
    yield
  ensure
    unless previously_stuck
      connection_id = self.connection.instance_variable_get(:@id)
      Makara::Context.release(connection_id)
    end
  end
end

MyModel.on_master do 
  # do stuff
end

Curious: is there any way to achieve the same thing with sticky: false?

chrisb avatar Jul 25 '18 13:07 chrisb

That’s exactly what I wanted to achieve, to have on master work with or without sticky flag. Thanks for sharing your block!

kigster avatar Jul 25 '18 14:07 kigster

@kigster yeah, sadly my snippet only works when sticky is true. If you find a way to force master with sticky disabled, please share!

chrisb avatar Jul 25 '18 15:07 chrisb

Do you managed to get this done for latest makara?

camol avatar Nov 21 '23 12:11 camol

Do you managed to get this done for latest makara?

Believe it or not, it's a very timely albeit complex question.


Background

I've been involved with Makara while I was a CTO at Wanelo.com, while Brian Leonard was VPE at TaskRabbit.

Our offices were close by and during one of our lunches he told me about Makara, and despite the fact that TaskRabbit were MySQL users and there was no PostgreSQL support, I instantly knew that this was exactly what my team needed, and we weren't afraid to port it to PostgreSQL. The year was 2010-2012.

while several other gems claimed the ability to spread the db load to a replica, upon further investigation we discovered that none other than Makara were written with multi-threading in mind.

Many people at the time used Unicorn and Resque – both single-threaded multi-process models (and highly memory inefficient).

We were already on Puma and Sidekiq, both multi-threaded gems.

Fast forward to literally right now, and as part of yet another scaling project at my current work, we are taking advantage of the relatively new native read/write split support that was available in Rails 6 & 7.

Having used Makara extensively, and having played with Rails read/write splitting in the last few months, I feel I can make a meaningful comparison.

I know this is not exactly what you asked but bear with me, because the short answer to your question is 'it depends'.

Rails Read/Write Splitting

This approach has some unfortunate limitations:

  • For each primary you can use no more than one replica (for now)

  • If the replica is on the critical path and something happens to it, the app will likely go down because there is no built-in recovery mechanism.

  • In order to take advantage of the replica, it seems like the least risky method is to identify and move large queries that are particularly heavy (good examples are background jobs that do not need to run urgently — being able to start a job with a slight delay, say 2 minutes) solves the majority of issues with replication delay and eventual consistency.

After a considerable effort to incorporate the new replica to help service the traffic, at the moment it still only serves around 10% of the select queries that we've migrated.

Makara

When we first started using Makara at Wanelo in 2011, our Rails site was peaking at around 250K RPMS.

Advantages of Makara

  • Makara is much more established, and has been in production on sites such as Wanelo and Task Rabbit since 2011; in 2018 I introduced Makara to Homebase (joinhomebase.com) and they never looked back. Obviously, Instacart uses it.

  • I consulted with several other companies that adopted Makara and have since scaled up their traffic orders of magnitude.

  • Not only did Makara allow companies to spread the read traffic across any number of replicas, but it offered a choice to send some arbitrary portion of the read traffic to the primary as well.

  • If your replicas aren't on identical hardware, you could assign weights to each replica, sending more traffic to the faster/larger replica, and less traffic to a smaller one.

  • In addition to scalability Makara offers "fault tolerance": automatic blacklisting of disconnected replicas with an automatic recovery. If one of the replicas dies, Makara would transparently blacklist it for a period of time, and stop sending traffic to that replica, while attempting to reconnect to it behind the scenes.

  • Makara supports "stickiness", which is a rather complicated concept and I am not entirely sold on its universal applicability.

So, what is Stickiness?

Stickiness implies a period of time that a web request thread will "stick to the primary" all select queries following an important DB write operation. Depending on the duration of stickiness, subsequent web requests from the same user may also be forced onto the primary. The goal behind this is, eg. so that you aren't suddenly unable to "see' data you just saved.

It was for this case (very short stickiness, and a small number of very critical queries) that prompted the force_master and retry_on_master helper proposals in this GitHub issue.

Turning on stickiness requires that you additionally and carefully choose the stickiness duration based on the traffic, immediacy requirements of the product, and other constraints.

Alternatives to Stickiness

while cookies may work on the web, for obvious reasons background jobs have no such luxury.

But they have other neat features that more than compensate for clunky stickiness on the web.

If you use Sidekiq, you have access to several very relevant features:

  1. Sidekiq Jobs can be partially or universally delayed by any number of seconds or even minutes. If instead of including Sidekiq::Worker into each job, you create an intermediate module of your own, eg Background::Worker, not only will you use that module to decouple your workers from direct links to Sidekiq, but you can overwrite peform_async() with perform_in(1.minute, *args, **opts)

For instance, sidekiq workers can't use stickiness, since they do not support a coookie.

  • But implementing Makara can accomplish horizontal scalability a lot faster to take advantage of one or more replica(s) by sendin them reads. By configuring the split in database.yml one can quickly divert 50% of select queries from the primary to the replica. Compare that to the "one query at a time" methdod using built-in Rails.

It should come as no surprise that personally I much prefer Makara, but I will be honest it's been a while since I've used it.

Stickiness and Replication Delay

These two concepts are extremely related. If your replicas are able to keep up with a typical replication delay within fractions of a second, then stickiness may not be needed (or can be extremely short say 300ms).

Replocation Delay introduces into the architecture a concept of "evemntual consistency".

TLDR;

I am currently pitching to my company to experiment with Makara. If I am successful, I'd ve more than happy to submit a proper PR with those helpers.

Sorry for the awefully long essay :)

kigster avatar Nov 21 '23 15:11 kigster

Thank you for this. Really helpfull. Our main intention is to sometime read from master in order to avoid replica-lag which in some cases might be expierienced in our APIs. It is not like we use it everywhere but it really helped us in many really strange cases.

camol avatar Nov 21 '23 15:11 camol

Keep in mind you can execute a fast and relatively cheap query against replica to compute the replication delay.

If I wrote this, I'd run this periodically this on a single dedicated thread in that Ruby VM. Then did this queries that timing is critical you can query the thread about the latest delay and make your decision accordingly.

kigster avatar Nov 21 '23 15:11 kigster

Well in API I have no choice the data needs to be there right away and sometimes it is simply not there and we can not afford to wait - usually just simple Rails "find()" stuff.

On Tue, 21 Nov 2023 at 16:39, Konstantin Gredeskoul < @.***> wrote:

Keep in mind you can execute a fast and relatively cheap query against replica to compute the replication delay.

If I wrote this, I'd run this periodically this on a single dedicated thread in that Ruby VM. Then did this queries that timing is critical you can query the thread about the latest delay and make your decision accordingly.

— Reply to this email directly, view it on GitHub https://github.com/instacart/makara/pull/209#issuecomment-1821167953, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB44ABSFWMT3VO3SRG6EQLYFTDMDAVCNFSM4FEXO6H2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBSGEYTMNZZGUZQ . You are receiving this because you commented.Message ID: @.***>

camol avatar Nov 21 '23 17:11 camol

@camol Have you considered adding a database level statement timeout?

Often times when replicas lag it's because someone is running a long query on a replica that, in order to finish, must push back on WAL log transfer and application.

kigster avatar Nov 21 '23 17:11 kigster

The other trick we used was to have a sidekiq server that was ONLY connected to the master.

The rest (a lot more) read only from the replica.

If we enqueued a job that must get the current data we used the queue that was attached to the primary.

kigster avatar Nov 21 '23 17:11 kigster

This does not solve the problem for us the queries are fast and simple and we can not afford to wait extra since simply it means that replica not yet have a data and simply record is not found on slave at the moment and we absolutely need master for this particular read. This strategy worked for us perfectly well and we do not want to risk any issues and we are trying to find a way to maintain the exact same behaviour.

On Tue, 21 Nov 2023 at 18:06, Konstantin Gredeskoul < @.***> wrote:

The other trick we used was to have a sidekiq server that was ONLY connected to the master.

The rest (a lot more) read only from the replica.

If we enqueued a job that must get the current data we used the queue that was attached to the primary.

— Reply to this email directly, view it on GitHub https://github.com/instacart/makara/pull/209#issuecomment-1821321013, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAB44AGM24BWLNMRP3S3GU3YFTNPVAVCNFSM4FEXO6H2U5DIOJSWCZC7NNSXTN2JONZXKZKDN5WW2ZLOOQ5TCOBSGEZTEMJQGEZQ . You are receiving this because you were mentioned.Message ID: @.***>

camol avatar Nov 21 '23 20:11 camol

This concept — spreading the reads to a potentially lagging replica — exists for a very specific reason.

It's needed when your traffic goes above what the largest database instance with 1Tb of Ram, 256 CPU cores and a 15-SSD disk array is not enough.

Do NOT use replicas for reads if you need 100% accuracy (like in financial or medical domains).

But absolutely DO use it when you can either tolerate a slight delay by queuing your jobs either a slight delay, and when it's not mission critical if a user on occasion sees old data.

Great examples of apps that might use Makara are social apps, content delivery, chat, etc. Apps that are inherently asynchronous can scale 10X compared to using a single primary, by using many replicas with Makara routing the traffic.

The alternative for 100% accuracy is horizontal partitioning of the data across multiple masters. This is also the only method that works when your scaling problem is not the reads but write IO.

My 2c.

--kig

kigster avatar Nov 28 '23 10:11 kigster