sqlx 0.6.0: Pool::close does not completely close when awaited

I have a somewhat-hackish setup to create a test database for every integration tests, on struct creation it creates the database with a connection pool and on drop it closes the pool before deleting the database. (Side Note: if someone has a better way of doing that I am definitely interested)

Since upgrading to v0.6.0 the database drop fails because there is still connections open:

Could not drop test db because of error returned from database: database "slicendice-integration-test-b61901b5-b86a-4e9b-820b-a8c81bf3cc8" is being accessed by other users

It only seems to happen on MacOS (I have not tested locally on Linux but here is a CI run demonstrating the problem).

I have not been able to create a small example that demonstrate the problem, following is my attempt that does not exhibit it.

Simple example


use std::env;
use std::time::Duration;

use anyhow::{Context, Result};
use async_std::task;
use sqlx::Connection;
use sqlx::{PgConnection, PgPool};
use tide::http::Url;
use uuid::Uuid;

async fn use_db(pool: PgPool) -> Result<()> {
    sqlx::query(
        "CREATE TABLE things (
          id SERIAL PRIMARY KEY,
          name TEXT NOT NULL
        )",
    )
    .execute(&pool)
    .await?;

    sqlx::query(
        "INSERT INTO things ( name )
         VALUES ( 'A name' )",
    )
    .execute(&pool)
    .await?;

    Ok(())
}

#[async_std::main]
async fn main() -> Result<()> {
    let db_url_string =
        env::var("DATABASE_URL").context("DATABASE_URL env variable needs to be set")?;

    // Create DB
    let mut db_url = Url::parse(&db_url_string)?;
    let id: Uuid = Uuid::new_v4();
    let mut db_name = String::from("slicendice-integration-test-");
    db_name.push_str(&id.to_string());

    db_url.set_path("/");
    let mut conn = PgConnection::connect(&db_url.to_string()).await?;

    sqlx::query(&format!(r#"CREATE DATABASE "{}""#, &db_name))
        .execute(&mut conn)
        .await?;

    // Use DB
    let mut db_path = String::from("/");
    db_path.push_str(&db_name);
    db_url.set_path(&db_path);
    let pool = PgPool::connect(&db_url.to_string()).await?;
    use_db(pool.clone()).await?;

    // Drop DB
    pool.close().await;

    dbg!(pool.size());
    dbg!(pool.num_idle());

    let mut db_url = db_url.clone();
    db_url.set_path("/");

    let mut conn = PgConnection::connect(&db_url.to_string()).await?;
    // Using format here because bind doesn't work for DROP DATABASE
    sqlx::query(&format!(r#"DROP DATABASE "{}""#, &db_name))
        .execute(&mut conn)
        .await?;

    Ok(())
}

My guess would be that because I am passing the pool to tide's context it's being held longer that in my simple example as when I create a test that only creates the scaffold and makes a call to the database without creating a tide application the problem does not happen.

I have added a call to Pool::size after an awaited Pool::close, in v0.5 it is consistently equal to 0 while in v0.6 it is always equal to 1.

Jun 24 '22 08:06 TimDeve

Hey, so I run into this just now with my similarly custom funky test harness. I was able to home the issue to the usage of the block_on executor for closing the database pool in the Drop implementation. I.e. an executor different from the Tokio executor it was created in not to mention which sqlx was feature-flag-configured with. Looking at your link, it appears to be a similar case.

Good day.

Aug 11 '22 22:08 Yohe-Am

Thanks for taking the time to look at my code, do you mean that you were using tokio as an executor but used async-std block_on? As far as I know I'm using async-std everywhere. Do you know if there is anything in my feature flag that I should be doing differently?

Aug 12 '22 12:08 TimDeve

Having a similar issue for an identical use case, it's not just macOS. I have issue on docker setup as well. However, my issue with postgres, pool.close().await never returns when run from a different thread.

Aug 12 '22 15:08 jbis9051

@TimDeve Not all futures are portable across runtimes even if the runtimes are from the same library and in some cases, even if they're configured identically. In this case, the connections in the pool are being spawned and used in the probably multi threaded sophisticated runtime your app is using but you're trying to close them on a simple block_on runtime. What I did to fix the bug was use a dedicated async close function on the Scaffold and used block scoping in order to be able to call it after all the objects borrowing from it are dropped.

Aug 13 '22 00:08 Yohe-Am

I believe this is what you are suggesting: No block_on, everything else but the pool as been dropped by the block scope and the async drop is called manually from withing the same runtime.

However the error is still there.

It seems the refactoring of Pool::close in 0.5 -> 0.6 has introduce a regression as the docs say the following:

[Pool::close] returns a Future which can be .awaited to ensure all connections are gracefully closed. It will first close any idle connections currently waiting in the pool, then wait for all checked-out connections to be returned or closed.

It clearly doesn't wait otherwise the following error wouldn't happen:

 database "slicendice-integration-test-8af97dc3-158e-42f6-b16a-364ec4d3f37" is being accessed by other users

Aug 13 '22 13:08 TimDeve

@Yohe-Am

What I did to fix the bug was use a dedicated async close function on the Scaffold and used block scoping in order to be able to call it after all the objects borrowing from it are dropped.

I'm confused what you mean here. If you don't put the cleanup calls in the Drop impl, then on failure the database won't be cleaned up. Unfortunately, Drop isn't async so we must create a separate runtime. Which causes issues with closing pools (at least with postgres).

Aug 13 '22 14:08 jbis9051

After some additional testing it seems the problem doesn't happen if the pool has been dropped, so a potential workaround if your setup is similar (a Pool field in a struct where you've defined Drop) then wrapping Pool in a Option and setting that to None once you have called Pool::close should do the trick.

Here's an example in Gist.

I'm trying to figure out what has changed in the sqlx source code to cause that but I've not found it yet.

Aug 13 '22 14:08 TimDeve

Maybe I should open a new issue. I'm pretty sure @TimDeve 's solution would block here with postgres

Aug 13 '22 14:08 jbis9051

Ah you're right we have opposite problem it seems? Mine is that Pool::close returns despite not having closed all connections and yours that it doesn't return 😅

Aug 13 '22 14:08 TimDeve

@jbis9051 I'm confused what you mean here. If you don't put the cleanup calls in the Drop impl, then on failure the database won't be cleaned up.

Yes, this is correct. Sorry I forgot to mention that. The builtin test runner also has this as a "feature" for debugging sakes. But this got me curious and the internal impl of PoolConnection makes use of the sqlx-rt crate to fetch a statically stored common runtime for AsyncDrop purposes. I guess this means that pool connections are designed to be transferable across runtimes from the same library and this is indeed a bug.

@TimDeve However the error is still there.

Curious. You can see what worked for me here and here. I did try dropping the pool first but it did not help.

After some investigation, I see that the async-std block_on function is not the same as the futures::runtime::block_on and yes, from the finding above, that should be sufficient for pool.close to function. Apologies for the confusion.

Aug 15 '22 07:08 Yohe-Am

But this got me curious and the internal impl of PoolConnection makes use of the sqlx-rt crate to fetch a statically stored common runtime for AsyncDrop purposes.

No, it does not. It references the Tokio runtime set for the current thread the code is executing on (sqlx_rt::Handle is a reexport of tokimo::runtime::Handle), only if the runtime-tokio-* feature is enabled. That static RUNTIME is for use in sqlx-macros only, which is implied by the comment immediately above it. It doesn't really need to exist in sqlx-rt for that reason which is why I've moved it in my current refactoring effort (#2039).

I guess this means that pool connections are designed to be transferable across runtimes from the same library and this is indeed a bug.

They are not, and Tokio actually does not like it when you move TcpStreams from one runtime to another. It does continue to work but only until the TcpStream's original runtime exits, at which point it will panic if you try to use it.

async-std only ever has one runtime per process so that's different.

The builtin test runner also has this as a "feature" for debugging sakes.

That is a deliberate feature, yes. If you have a test failure, it's useful to go back and look at the contents of the database to help figure out what went wrong. They won't pile up forever, though. Old test databases are cleaned up every time on startup of any binary using #[sqlx::test] (as stated in the next paragraph after the one that describes this feature), so at most you'll have as many as the number of failed tests in the latest run.

I had intended to add commands to sqlx-cli to manage test databases but I wanted to focus on getting the core #[sqlx::test] functionality working and merged.

All that aside, I've realized that there's likely a TOCTOU bug in Pool::close() where it thinks all connections have been released and then returns, but there may still be some in-flight.

Specifically, here: https://github.com/launchbadge/sqlx/blob/main/sqlx-core/src/pool/inner.rs#L257

If .close() is called on another thread between .acquire_permit() and .try_increment_size() when the size is zero, .close() will exit because it sees the size is zero and (incorrectly) assumes that there will never be any more connections on the pool.

It used to acquire self.options.max_connections permits right away, which would preclude any concurrent connection attempts, but that had to be changed with the #[sqlx::test] functionality as that allows pools to have a parent/child relationship where the child steals semaphore permits from the parent as needed, in order to enforce a global connection limit for all running tests. Simply put, there may not be max_connections permits on the semaphore anymore, and stealing them from the parent pool just so the child pool can close would be wasteful.

I think the fix here would be to change .try_increment_size() to immediately return Err(permit) if the pool is closed, and then the logic in acquire() will loop back to the top and then return Err(Error::PoolClosed) on the next call to .acquire_permit().

That should eliminate the possibility of size being zero and thus .close() returning while connection attempts can still be pending.

Aug 16 '22 23:08 abonander

@abonander Have you been able to find any details with the postgres issue? or is it simply unsupported because of tokio?

Aug 16 '22 23:08 jbis9051

The original issue concerns async-std, not Tokio. The discussion as a whole has been muddied quite a bit because of that.

Regarding Tokio, using a separate runtime to close the pool is likely going to break things so I would avoid doing that. You should also never use futures::executor::block_on() as that may prevent Tokio's I/O driver from running, especially on a current-thread runtime like that created with #[tokio::test].

A minimal example reproducing the issue would be a big help.

Aug 17 '22 00:08 abonander

@abonander I can create a new issue if need be. Here's an example:

use sqlx::migrate::MigrateDatabase;
use sqlx::postgres::PgPoolOptions;
use sqlx::{Executor, Pool, Postgres};
use std::{env, thread};
use tokio::runtime::Runtime;
use uuid::Uuid;

struct TempDb {
    pool: Option<Pool<Postgres>>,
    db_url: String,
}

impl TempDb {
    pub async fn new() -> TempDb {
        let db_url = format!("{}/{}", env::var("DB_URL").unwrap(), Uuid::new_v4());
        Postgres::create_database(&db_url).await.unwrap();
        let pool = PgPoolOptions::new()
            .max_connections(5)
            .connect(&db_url)
            .await
            .unwrap();

        Self {
            pool: Some(pool),
            db_url,
        }
    }

    pub fn pool(&self) -> &Pool<Postgres> {
        self.pool.as_ref().unwrap()
    }
}

impl Drop for TempDb {
    fn drop(&mut self) {
        let pool = self.pool.take().unwrap();
        let db_url = self.db_url.clone();
        thread::spawn(move || {
            let runtime = Runtime::new().unwrap();
            runtime.block_on(async {
                println!("inside runtime");
                pool.close().await;
                println!("closed");
                drop(pool);
                println!("dropped");
                Postgres::drop_database(&db_url).await.unwrap();
                println!("dropped db");
            });
        })
        .join()
        .expect("thread failed");
    }
}

#[tokio::test]
pub async fn example() {
    let db = TempDb::new().await;
    let pool = db.pool();
    pool.execute("SELECT 1").await.unwrap();
}

Aug 17 '22 03:08 jbis9051

Thank you @abonander for looking into this. It seems that sqlx::test essentially does what my hacky scaffold did but better so I shall switch to that.

Aug 17 '22 13:08 TimDeve

@jbis9051 yeah, you're deadlocking the #[tokio::test] runtime by doing that. The use of pool in the test function spawns a task to return the connection to the pool, which won't be executed while the TempDb drop impl is blocking the thread because Tokio's test runtimes are single-threaded, which will prevent the pool.close().await call from returning.

Like I said on Discord, I recommend using #[sqlx::test] because it takes care of all this for you.

Aug 18 '22 00:08 abonander