State locks should require a keepalive (heart beat), or release the lock
State locks are a pain- hear me out. All it takes is for deployer system to run out of disk, or memory, or the terraform apply to be interrupted, or AWS creds to expire, and you have a mess on your hands, and your automation will break, and cannot be repeated without manual intervention.
I'd like to suggest a keep alive message be required to maintain a lock on a resource. The lock itself could be a time entry. if the terraform process can't keep updating that time entry, we should have a mode that allows us to presume it has failed and the lock is no longer valid. This would really help automation recover more elegantly on subsequent runs.
Additionally, the process itself can kill itself if it fails to post a keep alive on the lock, so both sides can be taken care of, ensuring a lock that has expired is not being utilized.
Interesting suggestion, thanks for the enhancement request!
Hi @queglay,
Thanks for filing the enhancement request. The current state locking API was designed to be implemented across a wide variety of services, and the API only defines the minimal requirements for the locking mechanism. In order to implement a "heartbeat", a locking service must be actively monitoring the Terraform process, which is something that most passive services cannot do. There are some cases where the locks already are effectively limited to the current process. The Consul state storage for example requires an active client, behaving in exactly the way you describe, and filesystems locks only exist for the lifetime of the process. Some other storage mechanisms may use something like a lease with an expiration, which while they may not be able to implement any sort of active monitoring of the client, the lease can expire after a certain amount of time.
Since this is already possible with the state locking as is exists now, there's not much else we can do within the current implementation. It would be possible in future version of remote state storage to require a heartbeat mechanism for locks, but that would unfortunately severely limit the possible implementations.
"Some other storage mechanisms may use something like a lease with an expiration, which while they may not be able to implement any sort of active monitoring of the client, the lease can expire after a certain amount of time."
I don't see S3 / dynamo DB doing this in .13. Because locks are problems all the time if my cloudshell session gets killed. If it doesn't use leases that can expire, why can't it?
Hi @queglay, although I do not have a good answer for your question, I am going to close this issue as it is not likely to be implemented in Terraform in the future. That said, if one of the maintainers has time, they may yet answer your question. Thanks again for the enhancement suggestion!
I don't understand why you are closing the request when the specifics remain unanswered. State locks are an absolute PITA and I'm not satisfied with the existing response.
I am happy to re-open this if you feel it was closed prematurely.
Hi @jbardin , may I check my understanding of what you're saying?
- Locking behaviour is dependent on the specific backend being used
- The S3 backend uses a dymamodb lock that doesn't expire
- The consul backend uses an expiring lock and maintains its lease while the process is active
So, the nature of the problem isn't so much that Terraform itself doesn't support locks with an expiry and a keepalive, but that the S3 backend doesn't. I'm guessing that @queglay , like ourselves, is also using the S3 backend.
DynamoDB locks definitely are designed to be used with an expiry and a keepalive, we use this approach elsewhere in our stack, so it seems like it should be possible to replicate the approach used for the Consul locks with the DynamoDB locks.
If we were to reframe this request as "use expiry and keepalive for the S3 backend lock", does that make this feature request more tractable?
Hi @alexf101,
This request was not for the S3 backend specifically, but rather a general keepalive/expiration mechanism. If you wish to submit something for only the AWS S3 backend, we can have the AWS team review the request and determine if it's a feature they want and if it works within the backend API constraints of Terraform. This issue was mainly left here for consideration as a concern when creating a new API for backends someday.
For background, the original intent of the Terraform API was that anything out of the ordinary happening would require manual intervention. The locks were put in place to protect the remote state, and a lock left open should be rare event which requires investigation and verification that the state is as it should be.