vault documentation for backup and restore of Vault

Is your feature request related to a problem? Please describe. A very common task for any sysadmin is to automatically backup data of all applications. Same thing applies to Vault obviously (and since it's a secret management application, it's one of the critical assets). Unfortunately the only documentation for Vault's maintenance I was able to find was https://www.vaultproject.io/docs/install/index.html - installlation guide.

Backup and restore docs are IMO essential part of documentation.

Describe the solution you'd like

Ideally, I'd like to see an Administration (or Maintenance) section on https://www.vaultproject.io/docs/install/index.html which would include a manual how to (a) install (b) back up (c) restore data from backup. It should also mention which files/directories and other data should be preserved to be able to succesfully re-install Vault while preserving the data.

For example of such documentation, see https://docs.gitlab.com/omnibus/README.html or https://www.jfrog.com/confluence/display/RTF/Managing+Backups

Describe alternatives you've considered

I've read through the docs, searched and decided to use mailing list: https://groups.google.com/forum/?utm_medium=email&utm_source=footer#!msg/vault-tool/GDhj-KVqtHk/87iY0QwbDAAJ

It did the trick - I was answered with very helpful answers - which I believe belong to actual product documentation.

Explain any additional use-cases

I hope this issue is self-explanatory. Feel free to tell me to clarify if it's not.

Additional context

n/a

Nov 05 '18 16:11 weakcamel

+1 from me. It will be helpful if there are some recommendations, success stories, etc around it.

Nov 08 '18 10:11 zeagord

CoreOS has a little doc for backup vault to aws s3 bucket: https://coreos.com/tectonic/docs/latest/vault-operator/user/recovery.html

Apr 12 '19 14:04 antcs

@antcs I think even the CoreOS authors may have gotten it wrong.

As stated in #7191 even if you can make an atomic snapshot of the backend Vault itself doesn't make it's changes in an atomic way in its backend. Meaning there is no way you can guarantee your backup is in a state which is consistent (and therefor usable) if Vault is running. The only way you can currently get a consistent snapshot of Vault's data is if you stop Vault, backup the backend and start Vault again.

Aug 14 '19 13:08 siepkes

Vault itself doesn't make it's changes in an atomic way in its backend. Meaning there is no way you can guarantee your backup is in a state which is consistent (and therefor usable) if Vault is running. The only way you can currently get a consistent snapshot of Vault's data is if you stop Vault

Backups aside, if Vault does not make transactional writes with any backend, and also does not know how to recover from an atomic point-in-time storage-level snapshot of these potentially logically incomplete writes (by applying redo/undo logs or such from the storage), does this not also mean that Vault cannot reliably recover from an abrupt instance failure in between two writes?

Please tell me that is not the case ... @siepkes

Aug 23 '19 08:08 thiloplanz

@thiloplanz I'm quite sure that's the case (in worst case). That's also why I freaked out reading the original response on Vault mailing list.

Aug 23 '19 08:08 weakcamel

@thiloplanz Yeah that thought occurred to me too. I'm no expert on Vault's low level storage so what follows is mostly my deduction and assumptions so I could be wrong.

On the mailinglist Chris Hoffman (HashicCorp employee and Vault comitter) stated:

Since our storage layer is generic, we do not have a way to perform atomic transactions for multiple writes required for some operations. You could end up corrupting your data but it really just ends up that the behavior is undefined and there isn’t any guarantee here.

A quick glance at for example the PostgreSQL storage implementation shows that it exposes kind of a low level generic interface to the rest of the application. The rest of the application uses this interface to (sometimes) perform compound actions. For example call the update function 2 times to perform an operation which is functionally a single operation. This in contrast to for example an storage API which would expose high level operations and wraps the 2 updates in a single transaction or exposing a transaction API in the storage abstract itself so the caller can indicate what is a compound operation.

So backend data can get corrupted during an abrupt failure like an application panic. So the only thing that could save you from a really bad day is if Vault is smart enough to recover (ie. start normally with minimal data loss) with an inconsistent (ie. corrupt) data backend. I can't really find anything that would point to such capabilities in the source (again, could be wrong). Though if this was the case it wouldn't be a problem to end up with an inconsistent backup since Vault would still be able to recover from it and the backup advice would simply be: "backup the backend with the tools provided by the backend". But thats not the case.

Aug 23 '19 20:08 siepkes

So backend data can get corrupted during an abrupt failure like an application panic. So the only thing that could save you [...] is if Vault is smart enough to recover (ie. start normally with minimal data loss) [..]. I can't really find anything that would point to such capabilities in the source.

Meaning that regardless of choice of storage backend, a sudden power outage at an unfortunate point in time can leave Vault in an undefined state.

@chrishoffman Is this assessment correct?

Aug 24 '19 00:08 thiloplanz

@chrishoffman I don't want to be pushy or sound alarmist (I realize you don't owe me anything) but I'm somewhat unsettled by the fact that currently I don't really see how one can make a proper backup of Vault (ie. a consistent dump while Vault is running). Automated shutdown and start of Vault seems kind of a risky operation to perform daily for backups. Could you give some feedback on this? Would love to hear it if I'm talking nonsense ;-).

Sep 11 '19 15:09 siepkes

Did anyone find a working solution for creating backups? I feel very uncomfortable without backups on production :slightly_smiling_face:

Sep 26 '19 11:09 pznamensky

@pznamensky Sure -- take atomic snapshots at the storage level.

Vault doesn't write everything transactionally because we can't rely on having that capability in storage, but instead we write the code such that a failure in the middle of a request can be tolerated. We do this in various ways, via how we order writes, using WALs, etc. We can always improve this, but the idea that Vault will be in some unworking undefined state if improperly shut down isn't the case, and thus atomic storage snapshots are also fine.

Sep 26 '19 12:09 jefferai

@jefferai Thanks for your answer!

So the definitive answer is that making an atomic snapshot of the backend is enough and Vault will work with that?

I'm double checking because what your Hasicorp co-conspirator :wink: @chrishoffman says on the mailinglist seems to be contradictory to what your saying (emphasis mine):

Since our storage layer is generic, we do not have a way to perform atomic transactions for multiple writes required for some operations. You could end up corrupting your data but it really just ends up that the behavior is undefined and there isn’t any guarantee here.

Sep 26 '19 12:09 siepkes

I'm going to preface this post with an idea that a) Vault is a very nice piece of software which solves a very hard problem b) I do not want to sound entitled to solution, just to bring attention that important issue (as i understand it, at least) c) I'm very thankful for the work provided on this project

But to me current backup situation seems extremely worrying to the point that I'm afraid to run vault in production environments

Replies above stated that vault behavior when restoring from a hard crash (kill -9/power issues) is undefined even if storage backend can provide consistency guarantees (such as postgres or other dbms) Which is not the end of the world if vault could be consistently backuped, but again replies above imply that backing up storage backend cannot guarantee a valid vault state, even if it's made atomically.

Feb 28 '20 12:02 vladimir-avinkin

I absolutely second everything @mouzfun said, including the preface.

I'm thinking that with discrepancy in the comment above by @jefferai and the replies on the google groups, it would be best that this is simply clearly documented in official docs. to have the definite answer... nudge, nudge, pretty please Hashicorp? :-)

Feb 28 '20 13:02 weakcamel

Please add a backup/restore guide. I got here after I searched the documentation and didn't find a way for a backup. It would be great if such procedure is documented and battle-ready. Thank you!

Mar 10 '20 16:03 Aldekein

Is it possible to either get a statement from Hashicorp that the Open Source version of Hashicorp Vault cannot be backed up, or get an official documentation to backup data from it in a safe way? I think it is a show stopper issue for a lot of individuals and companies. Thank you in advance for your kind help!

Mar 24 '20 09:03 tmolnar0831

even an authenticated "vault secret kv dump" and "restore" would help immensely, like we are able to do with consul.

Apr 06 '20 15:04 ANPdjesrani

For the moment, we're doing vault backups by "migrating" data from consul to a filesystem storage backend. "Vault's data is encrypted at rest" so we just make sure access to this backup is restricted. The data can then be restored and brought back up using the key fragments.

vault operator migrate -config vault-migrate-backup.hcl

storage_source "consul" {
  address = "127.0.0.1:8500"
  path    = "vault"
}

storage_destination "file" {
 path = "/tmp/vault-backup"
}

Hope this is not a bad idea.

Apr 07 '20 08:04 dr4Ke

Using vault operator migrate sounds very elegant.

It does sound like it may not be completely safe though (unless you shut Vault down while doing that):

https://www.vaultproject.io/docs/commands/operator/migrate

This is intended to be an offline operation to ensure data consistency, and Vault will not allow starting the server if a migration is in progress. ... Vault will need to be offline during the migration process. First, stop Vault. Then, run the migration on the server you wish to become a the new Vault node.

Apr 16 '20 22:04 weakcamel

In case of the Raft storage the snapshot seems to be a reliable solution. Right?

Apr 17 '20 09:04 tmolnar0831

@michelvocks with all my appreciation for the great application that Vault is and the work, I believe this is not appropriately tagged as docs and enhancement.

Backup/restore is an essential feature of every product and lack of clear way to achieve it is in my opinion a high priority bug.

Apr 17 '20 11:04 weakcamel

Could we get an official statement on this situation please? Thanks a lot. Vault is great in every other regard.

Apr 17 '20 13:04 klemens-u

Yea. We are planing to use vault in production... Tbh this is a complete show stopper...

Apr 20 '20 07:04 cliedelt

Wow - I was quite unpleasently surprised by this. One question though, if I were to write a backup script, by stopping Vault, Tar'ing the data and starting Vault again (and unsealing) - is it guaranteed that stopping Vault in a normalt way (I'm running it as a Docker image) will do so in a secure way, meaning all writes are done before the process exists?

Apr 21 '20 11:04 lborupj

For what it's worth, if you are using consul as storage, you seem to be able to do a proper backup. I am new to this and have limited experience but I was able to trash my namespace holding consul and vault, then restore vault from backup.

in consul-master-0

# consul snapshot save vault-dev.snap
# consul snapshot inspect vault-dev.snap

copy the snapshot for safe keeping. After trashing the environment and rebuilding it,

copy the snapshot back to the new consul-master-0,

log into consul-master-0

# consul snapshot inspect vault-dev.snap
# consul snapshot restore vault-dev.snap

Your vault will be sealed. Unseal with old unseal keys and voilà!

Hope this helps.

Apr 28 '20 09:04 edoardo-c

Making atomic storage backend backups is your best bet so far, yes. But the problem is that apparently (according to the mentioned above google groups post) vault does not write its state atomically, even if the storage backend itself supports it.

Apr 28 '20 15:04 vladimir-avinkin

@pznamensky Sure -- take atomic snapshots at the storage level.

Vault doesn't write everything transactionally because we can't rely on having that capability in storage, but instead we write the code such that a failure in the middle of a request can be tolerated. We do this in various ways, via how we order writes, using WALs, etc. We can always improve this, but the idea that Vault will be in some unworking undefined state if improperly shut down isn't the case, and thus atomic storage snapshots are also fine.

Like @mouzfun , I am grateful for the free software offered to the community by Hashicorp.

However

For running in production, one needs to know how the guarantees are implementd. "Various ways, using WALs (several?), etc" is too vague.

I also do not quite understand "we can't rely on having that capability in storage" ? I dont see it as a valid reason for not doing it when the backend supports it.

Without proper guarantees, this issue is a deal breaker for production use.

Apr 29 '20 08:04 duckie

The way how we are tackling down this running Vault with Raft as storage backend is to run a service in each node which backup the data snapshot. The service I wrote is inspired to the etcd-manager built in the scope of the kops project to backup etcd.

Jun 06 '20 08:06 mazzy89

Note sure why Hashicorp hasn't pointed to this....

The following Hashicorp Support article details Migration of Vault Data Stored in Consul.

This article provides some detail and starting points related to migration of Vault data stored in a Consul cluster for the purposes of informing your own Vault backup/restore and data migration strategies when using Consul as your Vault storage backend.

@weakcamel can this issue be closed?

Jul 21 '20 23:07 bbros-dev

@bbros-dev not everyone uses a consul cluster.

Jul 22 '20 04:07 darkpixel

@darkpixel, true. There are, at this point in time, 22 backends.

Is the expectation this issue should be considered addressed when all 22 backends have such documentation?

For my 2c: The consul backend is a core OSS component, which Hashicorp gave us, so it is great that backend use case has been documented - giving the 'starting points' for other backend users to consider. The Filesystem backend use case is trivial - since it is a single server (dev/play) scenario. Or as @jefferai suggested:

... take atomic snapshots at the storage level.

The CoreOS document illustrates @jefferai's subtle point that just what is required to backup and restore data really depends on the backend you use.
Again my 2c: I agree; The state of our application/secrets at any point in time is outside of Vault's view and is our responsibility. Likewise, the state of the storage backend is also beyond Vault knowledge/control e.g. do your HDD cache etc, etc. etc. IF Vault restricted your backend choice, AND was closed source I could understand some of the objections. Your backend selection process will have (?) addressed what the backend backup/restore processes are. Example @mazzy89 rolled his own.

P.S

I don't understand @duckie's objection and all the upvotes:

... For running in production, one needs to know how the guarantees are implementd. "Various ways, using WALs (several?), etc" is too vague.

Which is why you chose an open source component - you know 'exactly' how everything on the Vault side is implemented: You have the source code.

No?

Jul 22 '20 06:07 bbros-dev

vault vault copied to clipboard

documentation for backup and restore of Vault

vault
vault copied to clipboard