shuttle [Bug]: Deployment hangs after `loading project from: /opt/shuttle/shuttle-executables/{GUID}` if there is an already-running deployment

What happened?

Previously, it was reasonably safe to deploy even if there was an already-completed and happily running deployment. Shuttle would build the project, quietly take down the existing deployment, and then start up the new one. As of 0.14 in most cases this now hangs and does not work. I think it was happening a little in 0.13 but it's much more common now.

When this happens, Shuttle will simply hang. From this state the only way out is to Ctrl+C.

At this point the project state is busted, running cargo shuttle project list will show the state of the project as errored with the message project did not become healthy in time. Listing deployments doesn't work. The project has to be restarted via cargo shuttle project restart.

At this point, if you run cargo shuttle deployment list you'll see the first deployment (not the one that stalled, that one's listed as stopped, or sometimes crashed) still running... despite the log (see below) stating that it successfully stopped.

From here, if you try to deploy again, you get right back to the stuck state. You must run cargo shuttle stop until the deployment is brought down. This can take a few tries (some folks are getting 404s trying this, which seems to be a different, intermittent issue) but eventually should work. From there it's again safe to deploy once more.

Seems like shuttle is getting a positive response as to the prior deployment stopping when it doesn't actually stop, but I don't know why.

Version

v0.14.0

Which operating systems are you seeing the problem on?

Windows

Which CPU architectures are you seeing the problem on?

x86_64

Relevant log output

2023-04-15T00:55:25.050358898Z TRACE shuttle_deployer::runtime_manager: making new client
2023-04-15T00:55:25.054553079Z DEBUG shuttle_deployer::runtime_manager: Starting alpha runtime at: /opt/shuttle/shuttle-executables/ca8afdf9-43ae-44f3-b044-fc8613b5acdd
2023-04-15T00:55:27.057863772Z  INFO shuttle_proto::runtime: connecting runtime client
2023-04-15T00:55:27.057935003Z DEBUG hyper::client::connect::http: connecting to 127.0.0.1:20788
2023-04-15T00:55:27.060551348Z DEBUG hyper::client::connect::http: connected to 127.0.0.1:20788
2023-04-15T00:55:27.062842346Z DEBUG {service.ready=true} tower::buffer::worker: processing request
2023-04-15T01:02:24.444196002Z TRACE {response="Response { metadata: MetadataMap { headers: {\"content-type\": \"application/grpc\", \"date\": \"Sat, 15 Apr 2023 05:02:24 GMT\", \"grpc-status\": \"0\"} }, message: StopResponse { success: true }, extensions: Extensions }"} shuttle_deployer::runtime_manager: stop deployment response
2023-04-15T00:55:27.065435878Z  INFO shuttle_deployer::deployment::run: loading project from: /opt/shuttle/shuttle-executables/ca8afdf9-43ae-44f3-b044-fc8613b5acdd

Duplicate declaration

[X] I have searched the issues and there are none like this.

Apr 15 '23 12:04 vexx32

I'm unsure if it's relevant, but the code I'm running is here if you need to look into what it's doing, dependencies, etc.: https://github.com/vexx32/thread-tracker

Apr 17 '23 01:04 vexx32

Hey @vexx32, thanks for this report! I tried to reproduce this now but I couldn't with a simple example. I tried by deploying the axum hello-world example several times when it was already running, and I also tried making changes to the example, as well as running with and without --no-test. It may be that the type of the project has something to do with it, for example if databases are used. :thinking: I'll take a look at your repo as well.

Apr 17 '23 12:04 oddgrd

@oddgrd I had a successful redeploy today for the first time in a bit, but I redeployed it shortly after that again and hit this issue once again. 🤔 It's not perfectly consistent but it's consistent enough to be a bit of a hassle 😅

Apr 18 '23 02:04 vexx32

It should be zero hassle, so we definitely need to figure this one out :sweat_smile: It would be great to narrow it down to a minimal reproducible example though, since a simple axum hello-world project does not trigger it.

Apr 19 '23 09:04 oddgrd

I'm wondering if it might have to do with me running periodic background tasks or something 🤔 the implementation of that in my project is fairly naive I would say, and it does use the database connection and Discord connections, so I wonder if keeping those sessions around causes Problems(tm)...

Apr 20 '23 19:04 vexx32

i'm experiencing something similar, although not sure if it's the same - i haven't been able to deploy successfully. each time i try, the deploy hangs at that same spot, although not due to previous deployments, since i got in the habit of running cargo shuttle project restart && cargo shuttle deploy

here's the repo i'm trying to deploy - https://github.com/DrewMcArthur/axum-webserver-example/

it does run successfully locally via cargo shuttle run. last deployment has been stuck there for around 8 hours.

imo, the first issue that should be fixed is to make sure that any errors underneath make it into these logs, so that i can see where it's getting stuck. as far as i can tell, it's happening here - https://github.com/shuttle-hq/shuttle/blob/c334a1c08cf82bc998ea6413edbbd3b969566373/deployer/src/deployment/run.rs#L337

let me know how i should proceed and how i could help debug!

Jul 16 '23 16:07 DrewMcArthur

today's update: i cloned the repo & got stuff up and running locally following the contributing docs.

I tried to make changes to the logs around the line i linked in my last comments, but i couldn't figure out how to use the new code for the deployer. (edit, figured out i have to re-run make images again)

either way, i did find these logs in the docker container running the deployer, in case those are helpful to anyone else trying to figure this out (this handful of lines repeats, seems like the client is polling for project status):

2023-07-16 22:29:07 2023-07-17T04:29:07.690984Z DEBUG hyper::client::connect::dns: resolving host="otel-collector"
2023-07-16 22:29:07 2023-07-17T04:29:07.698154Z DEBUG tonic::transport::service::reconnect: reconnect::poll_ready: hyper::Error(Connect, ConnectError("dns error", Custom { kind: Uncategorized, error: "failed to lookup address information: Name or service not known" }))
2023-07-16 22:29:07 2023-07-17T04:29:07.698200Z DEBUG tower::buffer::worker: service.ready=true processing request
2023-07-16 22:29:07 2023-07-17T04:29:07.698242Z DEBUG tonic::transport::service::reconnect: error: error trying to connect: dns error: failed to lookup address information: Name or service not known
2023-07-16 22:29:07 OpenTelemetry trace error occurred. Exporter otlp encountered the following error(s): the grpc server returns error (The service is currently unavailable): , detailed error message: error trying to connect: dns error: failed to lookup address information: Name or service not known
2023-07-16 22:29:56 2023-07-17T04:29:56.322199Z DEBUG hyper::proto::h1::io: parsed 1 headers
2023-07-16 22:29:56 2023-07-17T04:29:56.322234Z DEBUG hyper::proto::h1::conn: incoming body is empty
2023-07-16 22:29:56 2023-07-17T04:29:56.322541Z DEBUG request{http.uri=/projects/v3/status http.method=GET request.path="/projects/:project_name/status" account.name=""}: tower_http::trace::on_request: started processing request
2023-07-16 22:29:56 2023-07-17T04:29:56.322671Z DEBUG request{http.uri=/projects/v3/status http.method=GET request.path="/projects/:project_name/status" account.name="" request.params.project_name="v3" http.status_code=200}: shuttle_common::backends::metrics: finished processing request latency=244625 ns
2023-07-16 22:29:56 2023-07-17T04:29:56.322837Z DEBUG hyper::proto::h1::io: flushed 118 bytes

also, here's a link to my help thread on discord

it looks like this is because the otel-collector keeps restarting due to a missing data dog api key in my config. so probably unrelated to the main issue here, but would be nice to be able to turn off exporting logs to data dog for this local dev case

Jul 17 '23 04:07 DrewMcArthur

For context: In Drew's case, it was determined that this was caused by Tera when is did not find the templates folder. https://discord.com/channels/803236282088161321/1129993029596479568/1130808129232318505

Jul 20 '23 08:07 jonaro00