semian icon indicating copy to clipboard operation
semian copied to clipboard

WIP: Adapter for GitHub's Trilogy MySQL client

Open casperisfine opened this issue 2 years ago • 2 comments

This puts the basis of an adapter, it's still missing quite a few things.

We might also need to add a few things upstream, such as Trilogy#closed? etc.

Also Trilogy raise a lot of different base exceptions, so it's very tricky to list them all. We might need to resort to consider all IOError and SystemCallError subclasses.

Co-Authored-By: @adrianna-chang-shopify

casperisfine avatar Oct 03 '22 15:10 casperisfine

PR to add #closed? on the Trilogy Ruby binding: https://github.com/github/trilogy/pull/30

paarthmadan avatar Oct 05 '22 20:10 paarthmadan

Okay, tests are now passing for the Trilogy test suite 🎉 Handling the different base exceptions Trilogy raises is a bit tricky, we might need to do some work upstream before we can ship the adapter. I made some notes on the different situations we get into, would love feedback if anyone has time.

Expected cases

Query error

  • e.g. client.query("ERROR!")
  • Produces Trilogy::Error => trilogy_query_recv: 1064 You have an error in your SQL syntax;
  • Should not open circuit
  • See test_query_errors_does_not_open_the_circuit

Read timeout

  • e.g. client.query("SELECT sleep(5)")
  • Produces ::Errno::ETIMEDOUT
  • Should open the circuit
  • See test_read_timeout_error_opens_the_circuit

Connection errors due to network latency

  • Simulate with proxy latency, e.g. @proxy.downstream(:latency, latency: 2200).apply
  • Produces ::Errno::ETIMEDOUT
  • Should open the circuit
  • See test_connection_errors_opens_the_circuit

Network errors

  • Simulate with @proxy.down
  • Produces Trilogy::Error => [mysql_testing] trilogy_query_recv: TRILOGY_CLOSED_CONNECTION
  • Should open the circuit
  • See test_network_errors_are_tagged_with_the_resource_identifier

Errno::ECONNREFUSED

  • Occurs if we try to connect to the db when the network is down, e.g. @proxy.down
  • Should result in the circuit opening

Edge cases (looking for feedback)

Connection reset by peer

  • Errno::ECONNRESET: Connection reset by peer - trilogy_query_send
  • Occurs if too much time elapsed since connection was used
  • I don’t think this should cause the circuit to open, since this is a “fail fast” situation
  • See test_changes_timeout_when_half_open_and_configured:
diff --git a/test/adapters/trilogy_test.rb b/test/adapters/trilogy_test.rb
index 7b11ae0..8f56c21 100644
--- a/test/adapters/trilogy_test.rb
+++ b/test/adapters/trilogy_test.rb
@@ -356,7 +356,9 @@ class TestTrilogy < Minitest::Test
       # Trilogy has reset the connection request at this point, so need to reconnect
       # Circuit breaker is only created once with half_open_resource_timeout, so
       # no need to pass that through again
-      connect_to_mysql!
+      # connect_to_mysql!
+      # If we try to use the client here, we'll see connection reset by peer error
+      client.query("SELECT 1;")
     end

IOError

  • Occurs when attempting to use a client where the connection to the server has explicitly been closed (via #close)
  • Don’t think it should result in circuit opening, also a “fast failure” error

Trilogy::Error

  • [mysql_testing] trilogy_query_send: TRILOGY_CLOSED_CONNECTION
  • I'm also seeing this error pop up after ::Errno::ETIMEDOUT exceptions -- seems like Trilogy closes the connection
  • The tricky thing is that I don't think this type of error should result in the circuit opening, should it? But I'm not sure how to differentiate between this TRILOGY_CLOSED_CONNECTION exception and the one raised above when the network was down.. I haven't dug too deeply into the Trilogy source yet, but maybe we'll need two different types of errors to differentiate these two situations
  • See test_read_timeout_error_opens_the_circuit:
diff --git a/test/adapters/trilogy_test.rb b/test/adapters/trilogy_test.rb
index 7b11ae0..2a2eae7 100644
--- a/test/adapters/trilogy_test.rb
+++ b/test/adapters/trilogy_test.rb
@@ -88,7 +88,8 @@ class TestTrilogy < Minitest::Test
     # After Trilogy::CircuitOpenError check regular queries are working fine.
     result = Timecop.travel(ERROR_TIMEOUT + 1) do
       # Reconnect because trilogy closed the connection
-      client = connect_to_mysql!
+      # If we don't reconnect here we'll see Trilogy::Error
+      # client = connect_to_mysql!
       client.query("SELECT 1 + 1;")
     end

Also wondering about the difference between this and the "reset by peer" exception 🤔

One other interesting scenario I encountered: There's a strange interaction with the circuit being half-opened, a half_open_resource_timeout being set, and us needing to re-establish the connection with MySQL. This test demonstrates the issue:

def test_changes_timeout_when_half_open_and_configured
    client = connect_to_mysql!(half_open_resource_timeout: 0.5)

    # Timeout is 2 seconds, this causes circuit to open
    @proxy.downstream(:latency, latency: 3000).apply do
      ERROR_THRESHOLD.times do
        assert_raises(Errno::ETIMEDOUT) do
          client.query("SELECT 1 + 1;")
        end
      end
    end

    assert_raises(Trilogy::CircuitOpenError) do
      client.query("SELECT 1 + 1;")
    end

    client = Timecop.travel(ERROR_TIMEOUT + 1) do
      connect_to_mysql! # <=============== THIS LINE HERE
    end

    @proxy.downstream(:latency, latency: 750).apply do
      assert_raises(Errno::ETIMEDOUT) do
        client.query("SELECT 1 + 1;")
      end
    end

    Timecop.travel(ERROR_TIMEOUT + 1) do
      # Need to reconnect again because Trilogy reset conn request
      client = connect_to_mysql!
      client.query("SELECT 1 + 1;")

      # Timeout has reset to the normal 2 seconds now that circuit is closed
      @proxy.downstream(:latency, latency: 1500).apply do
        client.query("SELECT 1 + 1;")
      end
    end

    assert_equal(2, client.read_timeout)
    assert_equal(2, client.write_timeout)
  end

If you look at the line I've marked, we're needing to reconnect to MySQL because Trilogy has closed the connection. The circuit is half open, and so on calling #initialize and attempting to connect to the db, we go through with_resource_timeout, which uses self.read_timeout to fetch the current read timeout for the client. The problem is that we're doing this on the already-closed connection (remember we're in the process of reconnecting), so this produces an IOError.

I think eventually we'll want to check if the connection is #closed? and yield the block anways, so that we can always connect via Trilogy.new without the previously-closed-connection causing problems. To get around this in the meantime, I've opted to rescue IOError, check if the error is related to the connection being closed, and it if is yield the block anyways -- either we're in the process of reconnecting (Trilogy.new), so we'll disregard the IOError and reconnect properly, or we're acquiring some other resource, and it's likely yielding the block will re-raise:

    def with_resource_timeout(temp_timeout)
      prev_read_timeout = read_timeout
      self.read_timeout = temp_timeout
      yield
    rescue IOError => error
      raise unless error.message.match?(/connection closed/)

      yield
    ensure
      self.read_timeout = prev_read_timeout unless prev_read_timeout.nil?
    end

Does this make sense?

In general, Trilogy seems to be eager about closing unused connections (unlike Mysql2 AFAICT), so we might need to be more defensive in the adapter code.

Would love your thoughts @paarthmadan @casperisfine

adrianna-chang-shopify avatar Oct 07 '22 18:10 adrianna-chang-shopify

Connection reset by peer

This tells us the connection was closed on the server side. I'm not certain wether we should consider it as a network error. @miry any opinion?

IOError

The case you found indeed shouldn't open the circuit, my worry is that this error could also be raised in other context, but that's hard to tell.

Trilogy::Error

This is Trilogy's "kitchen sink error", as of today we have no choice but to match the error message to distinguish them. We should make a PR upstream to break this error down in logical subclasses, that can be done with perfect backward compatibility.

I'm also seeing this error pop up after ::Errno::ETIMEDOUT exceptions -- seems like Trilogy closes the connection

Yes, in most protocols, when you timeout you can't be certain what the server actually read or not, so it's safer to start from a clean connection.

But I'm not sure how to differentiate between this TRILOGY_CLOSED_CONNECTION exception and the one raised above when the network was down..

Server side vs client side. Connection reset by peer means we received a TCP FIN from the server, notifying us that the server closed the connection so we can't write into it anymore.

TRILOGY_CLOSED_CONNECTIONmeans the client closed the connection. e.g.File.open("/tmp/test.txt", "w+").tap(&:close).write("Test")`

It's a bit inconsistent for Trilogy to raise both IOError and Trilogy::Error TRILOGY_CLOSED_CONNECTION on a closed connection client side though...

In general, Trilogy seems to be eager about closing unused connections (unlike Mysql2 AFAICT), so we might need to be more defensive in the adapter code.

This might in big part be because MySQL2 has a reconnect options. I don't know if Trilogy has the same option or if we need to drop the while client and create a new one. That difference might require some changes.

casperisfine avatar Oct 10 '22 08:10 casperisfine

I don’t think this should cause the circuit to open, since this is a “fail fast” situation

This tells us the connection was closed on the server side. I'm not certain wether we should consider it as a network error. @miry any opinion?

There are various reasons for TCP/IP sending a reset, most lucky not related to overloaded DB or network disturbance. It would open a circuit breaker too often. I would ignore Errno::ECONNRESET in semians.

Something to note is that it could be ECONNRESET returns with Toxiproxy and Docker proxy usage. Docker proxy always listens open port, and if Toxiproxy enables "down" toxic - it could return ECONNRESET.

miry avatar Oct 10 '22 11:10 miry

Something to note is that it could be ECONNRESET returns with Toxiproxy and Docker proxy usage. Docker proxy always listens open port, and if Toxiproxy enables "down" toxic - it could return ECONNRESET.

Yes, that's a very annoying thing when doing Toxiproxy testing with a docker based CI or dev env.

casperisfine avatar Oct 10 '22 12:10 casperisfine

Thanks for the feedback, both of you! I'll make some adjustments, but does seem like we need to refactor the error classes upstream before proceeding with this.

Something to note is that it could be ECONNRESET returns with Toxiproxy and Docker proxy usage. Docker proxy always listens open port, and if Toxiproxy enables "down" toxic - it could return ECONNRESET.

Ah, interesting! Am I correct in interpreting this to mean that the ECONNRESET we're seeing in the tests (ie. what was described in #test_changes_timeout_when_half_open_and_configured) does not correspond to behaviour we'd see in production, and is purely a side-effect of testing with a Docker-based environment?

adrianna-chang-shopify avatar Oct 18 '22 20:10 adrianna-chang-shopify

Okay, I think this proof of concept is mostly in a finished state. A couple of things to note:

  • The fact that Trilogy raises Trilogy::Error TRILOGY_CLOSED_CONNECTION when the network is down is problematic. Trilogy is encountering some error (due to the network being down), closing the connection in response, and that's the error we see -- we need to dig into this upstream. The original error should be surfaced.
  • As previously discussed, using Docker for testing means that we'll see potentially misleading ECONNRESET errors due to the fact that Docker is always listening in order to do port forwarding.. Jean mentioned it might be worth refactoring the test suite to have TP run in the same container as the tests, so that our tests simulate prod errors more closely.. (cc @casperisfine -- I know you mentioned having done something similar in redis-client, is there a PR you can link? I went digging in https://github.com/redis-rb/redis-client but couldn't find anything )
  • We can't do half_open_resource_timeout if we build Semian into the client layer, because Trilogy forces us to create a new client every time we need to connect to the db (there's no way to reconnect). Consequently, if we're instantiating a new client because we lost the connection to the old one, and the circuit is in a half-open state, there's no way for us to grab the read_timeout off the previous (now closed) client. For now, it's probably okay to disregard this feature, but this is something that might push us to build Semian into the adapter layer instead.

Next steps:

  • We should experiment with writing Semian into the adapter layer.
    • Pros:
      • Don't have to deal with messy Trilogy errors
      • Don't have to worry about half open timeout stuff
      • Gives us more confidence in testing
      • Could possibly share code between Trilogy and Mysql2 adapter implementations
    • Downsides:
      • Means re-implementing Semian in SFR
  • We should follow up with Github to see if they have plans to move the error class stuff forward. If they aren't planning to get to this any time soon, we'll likely want to patch the Trilogy adapter instead to avoid the headache of the error classes.

adrianna-chang-shopify avatar Oct 20 '22 19:10 adrianna-chang-shopify

  • (cc @casperisfine -- I know you mentioned having done something similar in redis-client, is there a PR you can link?

Not a specific PR because it kinda evolved this way after a while. But quickly:

  • I have a tiny script to automatically download toxiproxy: https://github.com/redis-rb/redis-client/blob/4c8e05acfb3477c1651138a4924616e79e6116f2/bin/install-toxiproxy, it's a static binary, so super easy.
  • Then I have some test support code to spawn toxiproxy, load the config and shut it down at the end of the test suite: https://github.com/redis-rb/redis-client/blob/4c8e05acfb3477c1651138a4924616e79e6116f2/test/support/servers.rb#L154-L202

casperisfine avatar Oct 20 '22 20:10 casperisfine

Hi @adrianna-chang-shopify

I will not available for long term, just few things to not forget after you finish with all experiments and release a new adapter.

Test

I split running adapters in separate CI tasks. To load only required gems per adapter there is introduce some folder with gemfiles

I expect to see trilogy.gemfile and trilogy.gemfile.lock.

CI/CD

After you can tune https://github.com/Shopify/semian/blob/master/.github/workflows/test.yml#L112:

        gemfile:
          - grpc
          - mysql2
          - net_http
          - rails
          - redis_4
          - redis_5
          - redis_client
          - trilogy
        include:
          - gemfile: grpc
            adapter: grpc
          - gemfile: mysql2
            adapter: mysql2
          - gemfile: net_http
            adapter: net_http
          - gemfile: rails
            adapter: rails
          - gemfile: redis_4
            adapter: redis
          - gemfile: redis_5
            adapter: redis
          - gemfile: redis_client
            adapter: redis_client
          - gemfile: trilogy
            adapter: trilogy

Documentation

Need update README and CHANGELOG.

miry avatar Oct 21 '22 16:10 miry

@casperisfine Should we close this PR?

miry avatar Apr 24 '23 10:04 miry