pgtemp icon indicating copy to clipboard operation
pgtemp copied to clipboard

Failure to set up database due to race between Postgres startup and `createdb`

Open reivilibre opened this issue 1 year ago • 4 comments

Hi, thanks for writing this, I was thinking of such a thing months ago and it's always nice to see someone else wrote what I wanted :D.

The only problem I'm having is that occasionally the tests fail to start up. This is much more reproducible when my backup software is running in the background, putting a load on the system (and disk?).

The output I get most often is as follows:

createdb failed! stdout: 

stderr: createdb: error: connection to server at "localhost" (127.0.0.1), port 41499 failed: Connection refused
        Is the server running on that host and accepting TCP/IP connections?

thread 'tests::test_oidc_auth_flow::test_userinfo_bad_auth' panicked at /home/rei/.cargo/registry/src/index.crates.io-6f17d22bba15001f/pgtemp-0.3.0/src/lib.rs:378:14:
failed to start pgtemp server: JoinError::Panic(Id(2), ...)

Occasionally I get the following instead

stderr: createdb: error: connection to server at "localhost" (127.0.0.1), port 35153 failed: FATAL:  the database system is starting up

I think these errors mean that Postgres hasn't started up yet, but pgtemp is already trying to run createdb.

I'm not sure if there's a perfect way of waiting for startup to happen (I guess you can possibly parse the logs?), so perhaps just trying createdb a few times in a loop with a small delay is the way to go.

I'm using pgtemp v0.3.0.

reivilibre avatar Jul 07 '24 11:07 reivilibre

Hi, thanks for reporting this! This is indeed a known issue: There's currently a 100ms sleep and a TODO in the code.

There are probably a few ways to fix this: Read the logs as you mentioned, use (or reimplement) pg_isready, make the startup delay configurable, or do some number of retries on createdb before actually failing (as you also mentioned).

Reading the logs is probably the best thing to do but probably entails something like spawning a new thread to consume the subprocess's piped stdout so that the pipe doesn't get blocked, and then the rest of the control flow gets messy.

It might be easiest to add a loop with a timeout/failure counter that calls pg_isready. It seems that all it really does is call PQpingParams (pg docs for PQpingParams) so another option would be to replace the createdb call with actually connecting to the db and doing CREATE DATABASE. Depending on how complicated pg's wire protocol is we might be able to get away with not taking an extra dependency and just crafting the packets ourselves.

If you'd like to make a PR that just retries createdb 5 times with the existing 100ms delay and prints out stderr/stdout from the last failure I'd be glad to accept it. Maybe add a TODO with a link to this issue as well.

boustrophedon avatar Jul 07 '24 18:07 boustrophedon

I have something that works here: https://github.com/kaaveland/eugene/blob/344a5915722bce810bfe44b8955bbde21c53ab3d/eugene/src/tempserver.rs#L71

It's based on the log idea and indeed does spawn a second thread to watch it, sending logs back using a channel. I was meaning to replace this file with calls to pgtemp eventually, so I could take a stab at a PR that implements the same kind of solution here later this summer?

kaaveland avatar Jul 07 '24 20:07 kaaveland

Or if the code from that file works for pgtemp, just take it - it's MIT licenses.

kaaveland avatar Jul 07 '24 20:07 kaaveland

Oh, this was the project on HN. Thanks again for the mention in your blog post :)

If you'd like to make the PR for the log watching code at some point that would be great. If other people mention this is an issue for them I might make a quick PR to just do the createdb retry loop.

The log-watching code looks good, the only thing that I'd want in pgtemp is a way to keep the logs rather than discarding them - might need to just wrap a string in Arc<Mutex<_>>, or just collect the logs in a separate thread and then send them on a channel when the process exits. So we'd have one channel for "we saw the ready to accept message" and one channel for "here's the full logfile".

The only issue might be spawning too many threads, but I feel like you'd probably run into issues with spawning too many databases before the extra threads become an issue.

boustrophedon avatar Jul 07 '24 21:07 boustrophedon

Hey, we were running into this issue a few times so I opened #20 to address the issue. It implements the retry createdb as @boustrophedon suggested.

2xic avatar Apr 02 '25 16:04 2xic

Closing, thanks again for the contribution!

boustrophedon avatar Apr 07 '25 02:04 boustrophedon