semian
semian copied to clipboard
WIP: Adapter for GitHub's Trilogy MySQL client
This puts the basis of an adapter, it's still missing quite a few things.
We might also need to add a few things upstream, such as Trilogy#closed?
etc.
Also Trilogy raise a lot of different base exceptions, so it's very tricky to list them all. We might need to resort to consider all IOError
and SystemCallError
subclasses.
Co-Authored-By: @adrianna-chang-shopify
PR to add #closed?
on the Trilogy Ruby binding: https://github.com/github/trilogy/pull/30
Okay, tests are now passing for the Trilogy test suite 🎉 Handling the different base exceptions Trilogy raises is a bit tricky, we might need to do some work upstream before we can ship the adapter. I made some notes on the different situations we get into, would love feedback if anyone has time.
Expected cases
Query error
- e.g.
client.query("ERROR!")
- Produces
Trilogy::Error => trilogy_query_recv: 1064 You have an error in your SQL syntax;
- Should not open circuit
- See
test_query_errors_does_not_open_the_circuit
Read timeout
- e.g.
client.query("SELECT sleep(5)")
- Produces
::Errno::ETIMEDOUT
- Should open the circuit
- See
test_read_timeout_error_opens_the_circuit
Connection errors due to network latency
- Simulate with proxy latency, e.g.
@proxy.downstream(:latency, latency: 2200).apply
- Produces
::Errno::ETIMEDOUT
- Should open the circuit
- See
test_connection_errors_opens_the_circuit
Network errors
- Simulate with
@proxy.down
- Produces
Trilogy::Error => [mysql_testing] trilogy_query_recv: TRILOGY_CLOSED_CONNECTION
- Should open the circuit
- See
test_network_errors_are_tagged_with_the_resource_identifier
Errno::ECONNREFUSED
- Occurs if we try to connect to the db when the network is down, e.g.
@proxy.down
- Should result in the circuit opening
Edge cases (looking for feedback)
Connection reset by peer
-
Errno::ECONNRESET: Connection reset by peer - trilogy_query_send
- Occurs if too much time elapsed since connection was used
- I don’t think this should cause the circuit to open, since this is a “fail fast” situation
- See
test_changes_timeout_when_half_open_and_configured
:
diff --git a/test/adapters/trilogy_test.rb b/test/adapters/trilogy_test.rb
index 7b11ae0..8f56c21 100644
--- a/test/adapters/trilogy_test.rb
+++ b/test/adapters/trilogy_test.rb
@@ -356,7 +356,9 @@ class TestTrilogy < Minitest::Test
# Trilogy has reset the connection request at this point, so need to reconnect
# Circuit breaker is only created once with half_open_resource_timeout, so
# no need to pass that through again
- connect_to_mysql!
+ # connect_to_mysql!
+ # If we try to use the client here, we'll see connection reset by peer error
+ client.query("SELECT 1;")
end
IOError
- Occurs when attempting to use a client where the connection to the server has explicitly been closed (via
#close
) - Don’t think it should result in circuit opening, also a “fast failure” error
Trilogy::Error
-
[mysql_testing] trilogy_query_send: TRILOGY_CLOSED_CONNECTION
- I'm also seeing this error pop up after
::Errno::ETIMEDOUT
exceptions -- seems like Trilogy closes the connection - The tricky thing is that I don't think this type of error should result in the circuit opening, should it? But I'm not sure how to differentiate between this
TRILOGY_CLOSED_CONNECTION
exception and the one raised above when the network was down.. I haven't dug too deeply into the Trilogy source yet, but maybe we'll need two different types of errors to differentiate these two situations - See
test_read_timeout_error_opens_the_circuit
:
diff --git a/test/adapters/trilogy_test.rb b/test/adapters/trilogy_test.rb
index 7b11ae0..2a2eae7 100644
--- a/test/adapters/trilogy_test.rb
+++ b/test/adapters/trilogy_test.rb
@@ -88,7 +88,8 @@ class TestTrilogy < Minitest::Test
# After Trilogy::CircuitOpenError check regular queries are working fine.
result = Timecop.travel(ERROR_TIMEOUT + 1) do
# Reconnect because trilogy closed the connection
- client = connect_to_mysql!
+ # If we don't reconnect here we'll see Trilogy::Error
+ # client = connect_to_mysql!
client.query("SELECT 1 + 1;")
end
Also wondering about the difference between this and the "reset by peer" exception 🤔
One other interesting scenario I encountered:
There's a strange interaction with the circuit being half-opened, a half_open_resource_timeout
being set, and us needing to re-establish the connection with MySQL. This test demonstrates the issue:
def test_changes_timeout_when_half_open_and_configured
client = connect_to_mysql!(half_open_resource_timeout: 0.5)
# Timeout is 2 seconds, this causes circuit to open
@proxy.downstream(:latency, latency: 3000).apply do
ERROR_THRESHOLD.times do
assert_raises(Errno::ETIMEDOUT) do
client.query("SELECT 1 + 1;")
end
end
end
assert_raises(Trilogy::CircuitOpenError) do
client.query("SELECT 1 + 1;")
end
client = Timecop.travel(ERROR_TIMEOUT + 1) do
connect_to_mysql! # <=============== THIS LINE HERE
end
@proxy.downstream(:latency, latency: 750).apply do
assert_raises(Errno::ETIMEDOUT) do
client.query("SELECT 1 + 1;")
end
end
Timecop.travel(ERROR_TIMEOUT + 1) do
# Need to reconnect again because Trilogy reset conn request
client = connect_to_mysql!
client.query("SELECT 1 + 1;")
# Timeout has reset to the normal 2 seconds now that circuit is closed
@proxy.downstream(:latency, latency: 1500).apply do
client.query("SELECT 1 + 1;")
end
end
assert_equal(2, client.read_timeout)
assert_equal(2, client.write_timeout)
end
If you look at the line I've marked, we're needing to reconnect to MySQL because Trilogy has closed the connection.
The circuit is half open, and so on calling #initialize
and attempting to connect to the db, we go through with_resource_timeout
, which uses self.read_timeout
to fetch the current read timeout for the client. The problem is that we're doing this on the already-closed connection (remember we're in the process of reconnecting), so this produces an IOError
.
I think eventually we'll want to check if the connection is #closed?
and yield the block anways, so that we can always connect via Trilogy.new
without the previously-closed-connection causing problems. To get around this in the meantime, I've opted to rescue IOError
, check if the error is related to the connection being closed, and it if is yield the block anyways -- either we're in the process of reconnecting (Trilogy.new
), so we'll disregard the IOError
and reconnect properly, or we're acquiring some other resource, and it's likely yielding the block will re-raise:
def with_resource_timeout(temp_timeout)
prev_read_timeout = read_timeout
self.read_timeout = temp_timeout
yield
rescue IOError => error
raise unless error.message.match?(/connection closed/)
yield
ensure
self.read_timeout = prev_read_timeout unless prev_read_timeout.nil?
end
Does this make sense?
In general, Trilogy seems to be eager about closing unused connections (unlike Mysql2 AFAICT), so we might need to be more defensive in the adapter code.
Would love your thoughts @paarthmadan @casperisfine
Connection reset by peer
This tells us the connection was closed on the server side. I'm not certain wether we should consider it as a network error. @miry any opinion?
IOError
The case you found indeed shouldn't open the circuit, my worry is that this error could also be raised in other context, but that's hard to tell.
Trilogy::Error
This is Trilogy's "kitchen sink error", as of today we have no choice but to match the error message to distinguish them. We should make a PR upstream to break this error down in logical subclasses, that can be done with perfect backward compatibility.
I'm also seeing this error pop up after ::Errno::ETIMEDOUT exceptions -- seems like Trilogy closes the connection
Yes, in most protocols, when you timeout you can't be certain what the server actually read or not, so it's safer to start from a clean connection.
But I'm not sure how to differentiate between this TRILOGY_CLOSED_CONNECTION exception and the one raised above when the network was down..
Server side vs client side. Connection reset by peer
means we received a TCP
FIN
from the server, notifying us that the server closed the connection so we can't write into it anymore.
TRILOGY_CLOSED_CONNECTIONmeans the client closed the connection. e.g.
File.open("/tmp/test.txt", "w+").tap(&:close).write("Test")`
It's a bit inconsistent for Trilogy to raise both IOError
and Trilogy::Error TRILOGY_CLOSED_CONNECTION
on a closed connection client side though...
In general, Trilogy seems to be eager about closing unused connections (unlike Mysql2 AFAICT), so we might need to be more defensive in the adapter code.
This might in big part be because MySQL2
has a reconnect
options. I don't know if Trilogy has the same option or if we need to drop the while client and create a new one. That difference might require some changes.
I don’t think this should cause the circuit to open, since this is a “fail fast” situation
This tells us the connection was closed on the server side. I'm not certain wether we should consider it as a network error. @miry any opinion?
There are various reasons for TCP/IP sending a reset,
most lucky not related to overloaded DB or network disturbance.
It would open a circuit breaker too often.
I would ignore Errno::ECONNRESET
in semians.
Something to note is that it could be ECONNRESET
returns
with Toxiproxy and Docker proxy usage.
Docker proxy always listens open port,
and if Toxiproxy enables "down" toxic - it could return ECONNRESET
.
Something to note is that it could be
ECONNRESET
returns with Toxiproxy and Docker proxy usage. Docker proxy always listens open port, and if Toxiproxy enables "down" toxic - it could returnECONNRESET
.
Yes, that's a very annoying thing when doing Toxiproxy testing with a docker based CI or dev env.
Thanks for the feedback, both of you! I'll make some adjustments, but does seem like we need to refactor the error classes upstream before proceeding with this.
Something to note is that it could be ECONNRESET returns with Toxiproxy and Docker proxy usage. Docker proxy always listens open port, and if Toxiproxy enables "down" toxic - it could return ECONNRESET.
Ah, interesting! Am I correct in interpreting this to mean that the ECONNRESET
we're seeing in the tests (ie. what was described in #test_changes_timeout_when_half_open_and_configured
) does not correspond to behaviour we'd see in production, and is purely a side-effect of testing with a Docker-based environment?
Okay, I think this proof of concept is mostly in a finished state. A couple of things to note:
- The fact that Trilogy raises
Trilogy::Error TRILOGY_CLOSED_CONNECTION
when the network is down is problematic. Trilogy is encountering some error (due to the network being down), closing the connection in response, and that's the error we see -- we need to dig into this upstream. The original error should be surfaced. - As previously discussed, using Docker for testing means that we'll see potentially misleading
ECONNRESET
errors due to the fact that Docker is always listening in order to do port forwarding.. Jean mentioned it might be worth refactoring the test suite to have TP run in the same container as the tests, so that our tests simulate prod errors more closely.. (cc @casperisfine -- I know you mentioned having done something similar in redis-client, is there a PR you can link? I went digging in https://github.com/redis-rb/redis-client but couldn't find anything ) - We can't do
half_open_resource_timeout
if we build Semian into the client layer, because Trilogy forces us to create a new client every time we need to connect to the db (there's no way to reconnect). Consequently, if we're instantiating a new client because we lost the connection to the old one, and the circuit is in a half-open state, there's no way for us to grab theread_timeout
off the previous (now closed) client. For now, it's probably okay to disregard this feature, but this is something that might push us to build Semian into the adapter layer instead.
Next steps:
- We should experiment with writing Semian into the adapter layer.
- Pros:
- Don't have to deal with messy Trilogy errors
- Don't have to worry about half open timeout stuff
- Gives us more confidence in testing
- Could possibly share code between Trilogy and Mysql2 adapter implementations
- Downsides:
- Means re-implementing Semian in SFR
- Pros:
- We should follow up with Github to see if they have plans to move the error class stuff forward. If they aren't planning to get to this any time soon, we'll likely want to patch the Trilogy adapter instead to avoid the headache of the error classes.
- (cc @casperisfine -- I know you mentioned having done something similar in redis-client, is there a PR you can link?
Not a specific PR because it kinda evolved this way after a while. But quickly:
- I have a tiny script to automatically download toxiproxy: https://github.com/redis-rb/redis-client/blob/4c8e05acfb3477c1651138a4924616e79e6116f2/bin/install-toxiproxy, it's a static binary, so super easy.
- Then I have some test support code to spawn toxiproxy, load the config and shut it down at the end of the test suite: https://github.com/redis-rb/redis-client/blob/4c8e05acfb3477c1651138a4924616e79e6116f2/test/support/servers.rb#L154-L202
Hi @adrianna-chang-shopify
I will not available for long term, just few things to not forget after you finish with all experiments and release a new adapter.
Test
I split running adapters in separate CI tasks. To load only required gems per adapter there is introduce some folder with gemfiles
I expect to see trilogy.gemfile
and trilogy.gemfile.lock
.
CI/CD
After you can tune https://github.com/Shopify/semian/blob/master/.github/workflows/test.yml#L112:
gemfile:
- grpc
- mysql2
- net_http
- rails
- redis_4
- redis_5
- redis_client
- trilogy
include:
- gemfile: grpc
adapter: grpc
- gemfile: mysql2
adapter: mysql2
- gemfile: net_http
adapter: net_http
- gemfile: rails
adapter: rails
- gemfile: redis_4
adapter: redis
- gemfile: redis_5
adapter: redis
- gemfile: redis_client
adapter: redis_client
- gemfile: trilogy
adapter: trilogy
Documentation
Need update README and CHANGELOG.
@casperisfine Should we close this PR?