sqlx
sqlx copied to clipboard
Fix race condition in pool close (#3217)
Attempt to fix #3217.
@madadam if you rebase it should fix the CI failure.
Given that the PgListener test is consistently failing even after multiple re-runs, I'm wondering if there's some subtle problem with the fix here.
Finally found some time to look into this. The test was failing due to a deadlock: There was still one checked out connection inside the PgListener and so Pool::close was waiting for it to be released which never happened. The reason this was passing before is that the test accidentally relied on the old buggy behaviour of Pool::close where it didn't always wait for all connections to close. I fixed the test, rebased against main and updated the PR.
That's weird, now some of the migrations tests are timing out.
Yeah I noticed. I'll try to look into it when I can. Btw, how do you guys run these tests locally? I noticed that tests/x.py doesn't run the same test suite as what's run on the CI. In fact, I'm getting a compile error currently:
# unit test core
$ cargo test --no-default-features --manifest-path sqlx-core/Cargo.toml --features json,offline,migrate,_rt-async-std,_tls-rustls
warning: /home/adam/projects/sqlx/Cargo.toml: file `/home/adam/projects/sqlx/tests/sqlite/macros.rs` found to be present in multiple build targets:
* `integration-test` target `sqlite-macros`
* `integration-test` target `sqlite-unbundled-macros`
warning: /home/adam/projects/sqlx/sqlx-macros-core/Cargo.toml: unused manifest key: lints.rust.unexpected_cfgs.check-cfg
Compiling sqlx-core v0.8.3 (/home/adam/projects/sqlx/sqlx-core)
error[E0425]: cannot find value `provider` in this scope
--> sqlx-core/src/net/tls/tls_rustls.rs:107:54
|
107 | let config = ClientConfig::builder_with_provider(provider.clone())
| ^^^^^^^^ not found in this scope
Also, trying to run a single target using the --target option throws exception:
# test postgres 17
Traceback (most recent call last):
File "/home/adam/projects/sqlx/tests/./x.py", line 179, in <module>
run(
File "/home/adam/projects/sqlx/tests/./x.py", line 90, in run
database_url = start_database(service, database="sqlite/sqlite.db" if service == "sqlite" else "sqlx", cwd=dir_tests)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/adam/projects/sqlx/tests/docker.py", line 24, in start_database
res = subprocess.run(
^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/subprocess.py", line 548, in run
with Popen(*popenargs, **kwargs) as process:
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.12/subprocess.py", line 1026, in __init__
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.12/subprocess.py", line 1955, in _execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'docker-compose'
Ok, I think the problem is that when parent pool is used (which is the case in those failing tests), the child pool's semaphore is created with zero initial permits. So trying to acquire any permits on it in close causes deadlock. I need to think how to fix this.
@madadam I think we could just get rid of the parent/child pool thing. I've been conceptualizing a whole new architecture for Pool that it wouldn't fit into anyway.
Instead, we could just divide a default max_connections value, say, 64, by the number of test threads being spawned, and use a semaphore to lock that many permits at a time and give that many connections to each test (edit: actually, I'm not sure this is necessary, and it would seem to break when using nextest anyway).
We could use an environment variable, SQLX_TEST_MAX_CONNECTIONS to control the number of connections being divided up, and a control attribute to #[sqlx::test] to adjust the max_connections the pool should have (less or more).
Re. tests/x.py, I don't personally use it and the CI doesn't use it, so it's at the mercy of someone bothering to update it when it breaks. I've been meaning to get rid of it, but some people find it useful so it's not an easy decision. I also don't know what I would replace it with. Justfile, maybe? If anything?
Being able to run the same tests CI performs locally would be awesome, but there's also the issue of having a single source of truth for the tests. If commands get added to x.py/ the Justfile that aren't tested in CI, we have the same problem again. But I don't want CI to just be x.py --all-tests because that would have awful concurrency and wouldn't give great feedback on Github without setting up bots. So then adding a new test means adding it to the x.py/Justfile/whatever, and also adding it to CI.
https://github.com/nektos/act seems promising but it needs some tweaking since it doesn't support ubuntu-24.04 out of the box yet.
The top result I get from Reddit about locally runnable CI is "just use Makefiles"... gross.
I'm thinking it'd be really neat if cargo test just worked. Maybe using testcontainers.