shuttle [Bug]: multiple concurrent deployments causing crashes

What happened?

When deploying a service while a previous deployment is still loading/building, the new deployment and/or the project container may crash. This error presents itself in different ways and we haven't been able to exactly nail down the cause of each crash, but some common ones are a sqlite "database is locked" error for failed deploys and the project container going into an errored state which requires restarting the project.

This is a tracking issue for this bug, which we aim to resolve in our refactor on the feat/shuttle-runtime-scaling branch. We will split the building step into a separate service, which will also hash the source-code of the deploy, and if the next deploy is the same it will simply not start building it.

Version

v0.20.0

Which operating system(s) are you seeing the problem on?

In deployment

Which CPU architectures are you seeing the problem on?

In deployment

Relevant log output

No response

Duplicate declaration

[X] I have searched the issues and there are none like this.

Jul 03 '23 12:07 oddgrd

Hi, I want to work on this issue. Can I get more information about where and why this error is happening? Please also provide any telemetry logs that might be useful in debugging this issue. Thanks!

Aug 03 '23 07:08 that-ambuj

Hey @that-ambuj, thank you for showing interest in this issue! So we believe the problem is that when a new deployment is started, the project container (deployer) will start building it. Then, if another is started before the first one is finished, they will both enter the building state, and they will both write log output to the deployer sqlite database concurrently. This can lead to the sqlite database locking up and the deployment crashing.

As described in the issue we have a solution for this on a feature branch, so I'm not sure how worthwhile it is to resolve this on main currently (it could be quite complicated too). Some time may pass before we are ready to merge the fix on the feature branch however, so it's worth considering solving it now.

If we do decide it's best to wait on this issue, I'll try to find some other issues you might work on, if you're interested!

What are your thoughts on this @chesedo and @iulianbarbu?

Aug 03 '23 14:08 oddgrd

So, is this issue happening because the same user is trying to deploy the same project in parallel(with or without knowledge) or is it because there are multiple users trying to deploy at the same time to shuttle?

Basically, is the sqlite file stored on a user's local machine/docker container or on shuttle's servers?

Aug 05 '23 08:08 that-ambuj

@that-ambuj In this context, the sqlite db is local in the project container. deploy->Ctrl-C->deploy will make two enter building at the same time.

Aug 05 '23 10:08 jonaro00

Thanks! Understood. So If I want to work on this issue, I should use the feat/shuttle-runtime-scaling branch, right?

Aug 05 '23 18:08 that-ambuj

Hey @that-ambuj, thank you for showing interest in this issue! So we believe the problem is that when a new deployment is started, the project container (deployer) will start building it. Then, if another is started before the first one is finished, they will both enter the building state, and they will both write log output to the deployer sqlite database concurrently. This can lead to the sqlite database locking up and the deployment crashing.

As described in the issue we have a solution for this on a feature branch, so I'm not sure how worthwhile it is to resolve this on main currently (it could be quite complicated too). Some time may pass before we are ready to merge the fix on the feature branch however, so it's worth considering solving it now.

If we do decide it's best to wait on this issue, I'll try to find some other issues you might work on, if you're interested!

What are your thoughts on this @chesedo and @iulianbarbu?

I think we'll have to solve the problem of multiple concurrent builds on the new builder system too. My thinking is that we can approach it in a few ways:

For now: support only one build for a created project at any time. This means we should respond with an error message whenever a deploy request comes in and deployment is in the building state already. We should also add an option to deploy with force, which will stop the existing building/cancel the current deployment and start a new build/deployment corresponding to the last deploy request.
For later, and questionable at the same time: support multiple concurrent builds/deployments for the same service. Each deployment will have a unique URL and be kept alive as long as the user wants, with the option of being shut down on request. This will definitely be helpful when doing deployment previews for a Shuttle based service, which might be a feature looked at by teams that collaborate on a codebase and need previews for testing purposes.

So If I want to work on this issue, I should use the feat/shuttle-runtime-scaling branch, right?

Not really, I would say we need to start off from the current codebase (main branch), at least for 1). We could start working on this unless my thoughts above don't miss anything out of the picture.

Aug 07 '23 06:08 iulianbarbu

Hey @iulianbarbu, Thanks a lot of making this issue clearer for me. I have started to find the cause for this bug, it would require to change a few lines at deployer/src/lib.rs:50.

https://github.com/shuttle-hq/shuttle/blob/609411c2d98e52f20d24830aa527598a4347dc9f/deployer/src/lib.rs#L51-L59

Essentially It would require checking the state of the last deployment and ask user to take the neccessary action. If my idea is correct, I think I should starting working on implementing 1.

Aug 07 '23 12:08 that-ambuj

Hi @jonaro00 and @iulianbarbu, I have made some changes at https://github.com/that-ambuj/shuttle/tree/fix/concurrent-builds but the tests are a bit inconsistent.

They sometimes fail and most of the times pass, and most notably deployment::deploy_layer::tests::deployment_to_be_queued is failing. I'm not sure if my changes are causing these tests to fail because they also fail in a similar fashion on main branch. Also during manual testing by deploying, the pre deployment tests are also failing, but my guess is that's because of my changes. It is making it harder for me to reproduce and confirm that the issue is fixed.

Can you tell me more about what that test is supposed to do, so I can fix it?

Aug 18 '23 09:08 that-ambuj

I recently discovered a line in the code base that is supposed to stop deployments running in the background that are in Building, Loading, Queued or Built state. And the git blame goes 7 months back, so maybe this issue is unnecessary?

https://github.com/shuttle-hq/shuttle/blob/38f42bd81543421659bbbe0044c79b561a03f116/deployer/src/lib.rs#L46

https://github.com/shuttle-hq/shuttle/blob/38f42bd81543421659bbbe0044c79b561a03f116/deployer/src/persistence/mod.rs#L307-L318

Aug 20 '23 07:08 that-ambuj

@that-ambuj That function is only called on deployer startup, and does not prevent multiple deployments from entering building. @iulianbarbu Do you want the deployment to not enter the queue at all or make it wait in the queue? (If the former, there would be no need for a queue 😝)

Aug 21 '23 07:08 jonaro00

@jonaro00 I want to achieve the former case to avoid locking the sqlite database for the time being until we implement concurrent builds in the queue. For that reason we will need the queue.

What I think that prevents concurrent access to the sqlite database is because of a misconfiguration of WAL.

Aug 21 '23 15:08 that-ambuj