sqlx icon indicating copy to clipboard operation
sqlx copied to clipboard

Fix race condition in pool close (#3217)

Open madadam opened this issue 1 year ago • 9 comments
trafficstars

Attempt to fix #3217.

madadam avatar Jun 19 '24 09:06 madadam

@madadam if you rebase it should fix the CI failure.

abonander avatar Jul 16 '24 07:07 abonander

Given that the PgListener test is consistently failing even after multiple re-runs, I'm wondering if there's some subtle problem with the fix here.

abonander avatar Sep 15 '24 19:09 abonander

Finally found some time to look into this. The test was failing due to a deadlock: There was still one checked out connection inside the PgListener and so Pool::close was waiting for it to be released which never happened. The reason this was passing before is that the test accidentally relied on the old buggy behaviour of Pool::close where it didn't always wait for all connections to close. I fixed the test, rebased against main and updated the PR.

madadam avatar Feb 06 '25 12:02 madadam

That's weird, now some of the migrations tests are timing out.

abonander avatar Feb 09 '25 20:02 abonander

Yeah I noticed. I'll try to look into it when I can. Btw, how do you guys run these tests locally? I noticed that tests/x.py doesn't run the same test suite as what's run on the CI. In fact, I'm getting a compile error currently:

# unit test core
 $ cargo test --no-default-features --manifest-path sqlx-core/Cargo.toml --features json,offline,migrate,_rt-async-std,_tls-rustls 
warning: /home/adam/projects/sqlx/Cargo.toml: file `/home/adam/projects/sqlx/tests/sqlite/macros.rs` found to be present in multiple build targets:
  * `integration-test` target `sqlite-macros`
  * `integration-test` target `sqlite-unbundled-macros`
warning: /home/adam/projects/sqlx/sqlx-macros-core/Cargo.toml: unused manifest key: lints.rust.unexpected_cfgs.check-cfg
   Compiling sqlx-core v0.8.3 (/home/adam/projects/sqlx/sqlx-core)
error[E0425]: cannot find value `provider` in this scope
   --> sqlx-core/src/net/tls/tls_rustls.rs:107:54
    |
107 |     let config = ClientConfig::builder_with_provider(provider.clone())
    |                                                      ^^^^^^^^ not found in this scope

Also, trying to run a single target using the --target option throws exception:

# test postgres 17
Traceback (most recent call last):
  File "/home/adam/projects/sqlx/tests/./x.py", line 179, in <module>
    run(
  File "/home/adam/projects/sqlx/tests/./x.py", line 90, in run
    database_url = start_database(service, database="sqlite/sqlite.db" if service == "sqlite" else "sqlx", cwd=dir_tests)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/home/adam/projects/sqlx/tests/docker.py", line 24, in start_database
    res = subprocess.run(
          ^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 548, in run
    with Popen(*popenargs, **kwargs) as process:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/lib/python3.12/subprocess.py", line 1026, in __init__
    self._execute_child(args, executable, preexec_fn, close_fds,
  File "/usr/lib/python3.12/subprocess.py", line 1955, in _execute_child
    raise child_exception_type(errno_num, err_msg, err_filename)
FileNotFoundError: [Errno 2] No such file or directory: 'docker-compose'

madadam avatar Feb 10 '25 07:02 madadam

Ok, I think the problem is that when parent pool is used (which is the case in those failing tests), the child pool's semaphore is created with zero initial permits. So trying to acquire any permits on it in close causes deadlock. I need to think how to fix this.

madadam avatar Mar 03 '25 10:03 madadam

@madadam I think we could just get rid of the parent/child pool thing. I've been conceptualizing a whole new architecture for Pool that it wouldn't fit into anyway.

Instead, we could just divide a default max_connections value, say, 64, by the number of test threads being spawned, and use a semaphore to lock that many permits at a time and give that many connections to each test (edit: actually, I'm not sure this is necessary, and it would seem to break when using nextest anyway).

We could use an environment variable, SQLX_TEST_MAX_CONNECTIONS to control the number of connections being divided up, and a control attribute to #[sqlx::test] to adjust the max_connections the pool should have (less or more).

abonander avatar Apr 13 '25 09:04 abonander

Re. tests/x.py, I don't personally use it and the CI doesn't use it, so it's at the mercy of someone bothering to update it when it breaks. I've been meaning to get rid of it, but some people find it useful so it's not an easy decision. I also don't know what I would replace it with. Justfile, maybe? If anything?

Being able to run the same tests CI performs locally would be awesome, but there's also the issue of having a single source of truth for the tests. If commands get added to x.py/ the Justfile that aren't tested in CI, we have the same problem again. But I don't want CI to just be x.py --all-tests because that would have awful concurrency and wouldn't give great feedback on Github without setting up bots. So then adding a new test means adding it to the x.py/Justfile/whatever, and also adding it to CI.

https://github.com/nektos/act seems promising but it needs some tweaking since it doesn't support ubuntu-24.04 out of the box yet.

The top result I get from Reddit about locally runnable CI is "just use Makefiles"... gross.

abonander avatar Apr 13 '25 09:04 abonander

I'm thinking it'd be really neat if cargo test just worked. Maybe using testcontainers.

abonander avatar Apr 13 '25 10:04 abonander