Investigate Tokenserver db pool initialization failures on production
We’re currently unable to roll out the latest tokenserver release (0.17.1) due to a timeout error occurring during its db pool initialization where it establishes initial db connections.
This does not occur on previous releases (tested as far as 0.15.6) nor does it occur with the 0.17.1 release on stage.
┆Issue is synchronized with this Jira Task
➤ Philip Jenvey commented:
With the error no longer hidden by tokenserver we’re seeing:
Failed to create Tokenserver pool: DbError { kind: Mysql(MysqlError { kind: Pool(Error(None)), status: 500, backtrace: 0: <syncserver_db_common::error::MysqlError as core::convert::From<syncserver_db_common::error::MysqlErrorKind>>::from
This appears to be a timeout occurring while wait_for_initialization ( https://github.com/sfackler/r2d2/blob/6ab3302eeca82732a5e6c9385092fdb6e011f173/src/lib.rs#L399 ) waits to establish the initial connections
➤ Philip Jenvey commented:
The major differences between 0.15.6..0.17.1 ( https://github.com/mozilla-services/syncstorage-rs/compare/0.15.6...0.17.1 ):
-
browserid removal
-
Update from debian bullseye to bookworm
- and an associated update of mysql’s official libmysqlclient-dev C client lib. This likely bumped us from 8.0.37-1debian11 ( https://repo.mysql.com/apt/debian/dists/bullseye/mysql-8.0/binary-amd64/Packages ) to 8.0.39-1debian12 ( https://repo.mysql.com/apt/debian/dists/bookworm/mysql-8.0/binary-amd64/Packages ). There’s not a lot to speak of between these minor releases:
- https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-38.html#mysqld-8-0-38-capi ( https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-38.html#mysqld-8-0-38-capi )
- https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-39.html ( https://dev.mysql.com/doc/relnotes/mysql/8.0/en/news-8-0-39.html|smart-link )
- and an associated update of mysql’s official libmysqlclient-dev C client lib. This likely bumped us from 8.0.37-1debian11 ( https://repo.mysql.com/apt/debian/dists/bullseye/mysql-8.0/binary-amd64/Packages ) to 8.0.39-1debian12 ( https://repo.mysql.com/apt/debian/dists/bookworm/mysql-8.0/binary-amd64/Packages ). There’s not a lot to speak of between these minor releases:
-
Some dependency updates, but diesel has stayed the same (v1.4.8)
-
Fixing the threadpool issue ( https://github.com/mozilla-services/syncstorage-rs/pull/1560 ) seen in 0.15.6. I don’t see how this could be involved as the value is only passed to actix-web’s HttpServer builder, which doesn’t even occur until the db pool as succeed in initializing
-
Adding optional timeouts to db calls, but this setting is disabled
-
A number of other small changes that like the threadpool issue, don’t seem to be involved/a factor until after initialization has succeeded
➤ JR Conlin commented:
I’m guessing there are a few things we should also investigate:
- Is this version of Tokenserver able to connect to SQLStorage in the Stage environment as well? Is this a problem just in Prod?
- Can we access SQLStorage directly using some other application (ideally, without using the Google SDK, which may hide elements from us)? (There’s a sample ( https://github.com/GoogleCloudPlatform/python-docs-samples/blob/main/cloud-sql/mysql/sqlalchemy/connect_tcp.py ) that uses sqlalchemy to connect to a data store that we might be able to (ab)use.
- Do we need to use a secure connection (e.g. the Storage Proxy) with the new configuration?
- Can we use the rust Pool to connect to a different SQL store? (e.g. a locally running mariadb instance or something?)
➤ Philip Jenvey commented:
- It works on stage. Stage currently has 0.13.6 due to how we roll that back out to prod, but it successfully rolled out there and was working since it was tagged. E.g. https://mozilla.sentry.io/issues/5444090025/?project=6255075&query=is%3Aunresolved+issue.priority%3A[high%2C+medium ( https://mozilla.sentry.io/issues/5444090025/?project=6255075&query=is%3Aunresolved+issue.priority%3A[high%2C+medium )&referrer=issue-stream&statsPeriod=7d&stream_index=2|https://mozilla.sentry.io/issues/5444090025/?project=6255075&query=is%3Aunresolved+issue.priority%3A%5Bhigh%2C+medium%5D&referrer=issue-stream&statsPeriod=7d&stream_index=2]
- Not sure what you mean? Since the 0.13.6 release is working fine on prod, it can obviously access prod’s db
- No, we did not set prod’s db to require TLS
- Yup, see #1: stage’s db
➤ Philip Jenvey commented:
I’ll also note the underlying Pool impl crate did not change either (r2d2 v0.8.10)
➤ Philip Jenvey commented:
Unless I’m missing something, the only couple things I can think to try next are:
-
wait_for_initialization establishes min_idle.unwrap_or(max_size) number of connections, waiting a duration of connection_timeout . We set both of the formers to 100 connections and the timeout to 5 seconds. Does 0.17.1 suddenly require more time to establish them on prod for some reason? We could try setting min_idle to a very small value to see if it helps. Alternatively we could raise the timeout, but that timeout is also used as the Pool::get timeout which we really want to keep a small duration for. Note:
- Stage does differ here in that it only sets both min_idle/max_size to 20 (but with the same timeout of 5 seconds)
- Yet why wouldn’t we be able to establish 100 connections in 5 seconds? What’s the root cause?
-
Downgrade the underlying C API client libmysqlclient-dev back to the older version. This might require downgrading back to debian bullseye from bookworm.
RIght, I was just noting outside states that could have caused the problem we're seeing. The curious thing is that if 0.17.1 works correctly on stage, running bookworm connecting to the stage SQL Storage servers, yet the same configuration (0.17.1 running bookworm) does not communicate with the production servers, then I was wondering if the problem might be in our Pool code. If we have a second application (say one written in python) that is able to connect to the production SQL Storage servers from a server running bookworm, then we can isolate the problem to our Pool connection code. If we cannot connect to the production SQL Storage systems from the second application, there may be some other issue going on (e.g. some change in the C libraries or system configurations) which might be causing the problem.
➤ Philip Jenvey commented:
The Python mysqlclient ( https://pypi.org/project/mysqlclient/ ) also relies on the C libmysqlclient, so we could try such a test by seeing if any of the 0.17.1 scripts can connect (like process_account_events or purge_old_records)
➤ Philip Jenvey commented:
We were finally able to resolve initialization failures on startup via 2 different options:
- downgrading the mysqlclient (along with the Debian version)
- shrinking the minimum number of db connections (established on startup by the bb8 db pool) from the larger number of 100 (matching the max count) to the smaller value of 20 which is used on stage.
However solving this issue revealed the related issue of seeing excessive db timeouts after successful startup: https://mozilla-hub.atlassian.net/browse/SYNC-4541 ( https://mozilla-hub.atlassian.net/browse/SYNC-4541|smart-link )