duplicacy
duplicacy copied to clipboard
Improving terminology: snapshot-id very misleading & more
This ticket is meant as a start to making the language usage in Duplicacy much easier to understand, so that more people will use Duplicacy. Finding an actual technical writer/editor with very strong English skills will help here (at the very least, the editor should have very strong English).
From the Documentation:
This init command connects the repository with the remote storage at 192.168.1.00 via SFTP. It will initialize the remote storage if this has not been done before. It also assigns the snapshot id mywork to the repository. This snapshot id is used to uniquely identify this repository if there are other repositories that also back up to the same storage. You can now create snapshots of the repository by invoking the backup command.
This gives us the following misleading language:
- Terms: ** repository: something we want to back up ** storage: a destination for the backups ** snapshot-id: a unique identifier for the repository ** snapshots: backups of the repository
- Relations: ** One or more repositories can share a storage. ** Each repository has a snapshot-id. ** Snapshots of the repository are make to the storage. ** Each snapshot has a revision.
In common usage, it would make sense that a repository has a repository-id, and a snapshot has a snapshot-id.
Initial changes:
- The term snapshot-id should be replaced with repository-id.
- Repository is used in other programs as a destination for backups, this could cause confusion.
- There is no such word as Storages. Storage is an uncountable noun. The documentation usages of Storages make more sense as Storage Types or Storage Targets
Consider this as potential improved language:
- Source: The content source for the backup
- Destination: A destination target for the backup
- Destination Type: The type of the destination (such local-file, S3, SFTP), possibly with parameters.
- Each source has a unique source-id.
- Backups are snapshots of the source, made to the one or more destinations.
- Each snapshot has a revision.
The proposed terminology is a lot easier to understand 👍
I will work on this tonight and push a PR as soon as I can. Thanks for the suggestions.
Sorry for not responding earlier -- I just came back from a multiple-day trip without access to a computer.
I agree the current terminology, especially the use of repository/storage, may cause a lot of confusion. However, I chose them because there wasn't a better alternative. I don't like source/destination, because they are too general. Any better term for the directory to be backed up other than repository?
Another reason I chose repository because it is used by git/hg. The command model of Duplicacy closely follows that of git/hg and I think this term is appropriate in the sense that you can put everything want to back up here and they will be backed up automatically.
I'm open to suggestions. But source/destination do not sound like a good choice for me.
In the Git model, the repository IS the destination. git commit
takes something that you staged from your working directory and puts it into the repository; git push
copies from your local repository to a remote repository.
I do see your concern that source/destination have a potential to be too generic. Let's see if there is any other common terms shared by other backup apps that we can leverage.
how about repository/depot?
No, I think get away from repository entirely, since it doesn't say where it is in the Git/Hg sense.
I agree with moving away from the term repository. It was confusing to me also. I disagree with @gilbertchen that source (or even source data) are inferior terms to repository. I find the latter confusing, as it seems do at least some others.
As I'm hopeful that one day in the not-too-distant-future duplicacy will support backing up data on stdin (e.g. backing up the output of tar
). This would argue further for a generic term like source.
So I think the proposed terms in @robbat2's OP are good.
@robbat2 Thank you for beginning to take us down this path; starting this discussion has been on my to-do list for months, but I've just never found the time to do it.
I also think that there is value to creating a graphic that depicts all of the terms, since a picture is often worth a thousand words, so hopefully that can be one of the changes made as a result of this issue.
@gilbertchen I disagree that it's better to use terms that are misleading/confusing than to use terms that are overly general. Neither is ideal, and I fully support trying to find terms that everyone agrees are both precise and accurate (and taking time to search for them rather than leaping immediately at the first propose solution as soon as it's set forth, because I'm sure there will be a non-trivial effort to update the code to make it match the terms chosen), but if we have to choose between the two, I choose accuracy over precision.
If we're trying to draw parallels with Git's lexicon, the terms we'd be using would be local and remote repositories. But I don't think that we necessarily should be trying to use Git's vocabulary as a conscious design goal, because 1) Git has a notion of all repositories being peers/equivalent with bi-directional updates between them whereas backup software has a clear notion of the original and the one or more derived copies where updates (not restores) are unidirectional, and 2) we're not using Git's verbs such as push and pull, for good reason. We'd be better off consulting the vocabulary of other backup software (CrashPlan, Duplicity, etc.) for inspiration rather than simply mirroring Git's.
That was all pretty abstract, but here are my suggestions for a set of vocabulary terms to use:
- Backup Set: The content source for the backup
- Storage Location: A destination target for the backup
- Storage Location Type: The type of the destination (such as local-file, S3, SFTP), possibly with parameters.
- Snapshot: Same as current definition: A set of files that collectively represent the current state of the backup set at a defined point in time.
- Each backup set has a unique backup-set-id.
- Each storage location has a unique storage-location-id.
- Each snapshot has a unique snapshot-id, which is the same no matter how many storage locations the snapshot is written to.
- Backups are the act of creating snapshots of the backup set and writing them to the one or more storage locations.
- The term "revision" is struck from the vocabulary.
Those terms are more in line with the terms I've seen used by other backup software products rather than version control products. And the only one that might not be intuitive to someone who's never looked at a backup product before is "backup set," but it's the term used by other products (e.g. CrashPlan) and it makes sense once it's explained (so users might be confused the first time they see it but will remember what it means thereafter).
As a suggestion, so we don't pollute this issue a lot: why not make a google excel sheet, and there put the current term used and in the following columns possible suggestions, maybe even with something like a column storing votes for each alternative, and a comments column.
In this way we can way-way-way easier check the suggestions and discuss over them.
(offtopic: sorry for gdrive retry implementation being so late, i was out of country, and im doing 10hours @ work. i'm not quite in the mood to work on anything after coming home :-s )
How about "Subject" for the thing that is to be backed up? Or "Data collection"?
"Source" is OK for me, too, as it's where the data comes from: it's the source of the data handled by the backup process.
I find "Backup set" confusing, as it's not clear whether this is a set of things to be backed up, or the set of things that form the backup.
I like Storage and Snapshot, they make sense.
[By the way, I'm another Crashplan refugee. I'm very pleased to have discovered Duplicacy, and to find that it works so well! The terminology is a minor issue, it doesn't take long to work out how to back up and restore, and what the terms mean. The excellent de-duplication is worth a fortune!]
Why there is a need to invent new terminology here? Duplicacy is a backup program and it should really adopt very similar language, used by other backup programs - this will improve user experience. Take some widely used backup applications, analyze what terms they use to call different objects they work on and apply here. Duplicacy will differentiate itself with technology, not terminology.
I started a googledoc here to collect existing terminology. https://docs.google.com/spreadsheets/d/14-ZJyWGgw5jt163Jh_3Wkq8jRY25R-TfPYUfxAg-x1U/edit?usp=sharing
Glad I found this thread.
Just finished reading the documentation (June '19) and had exactly the same reaction as @robbat2.
Any reason why documentation cannot be updated to include the industry-standard terminology suggested by @tbain98?
Same here as all the others - looking at prune documentation, I started doubting everything I knew about duplicacy, backups, and the world. https://forum.duplicacy.com/t/prune-command-details/1005
OPTIONS:
-id <snapshot id> delete snapshots with the specified id instead of the default one
-all, -a match against all snapshot IDs
-r <revision> [+] delete snapshots with the specified revisions
-t <tag> [+] delete snapshots with the specified tags
I was led here to this issue thread as I went searching Google with the exact same confusion around the prune command. Three years on, this one needs to be resolved.
I'm new to duplicacy and still learning the bits, but, assuming my understanding is correct, there's a relationship that's missing in this discussion. A "snapshot-id" is a unique identifier for a repository for a given storage. If you think about it in those terms it makes more sense to call it "snapshot-id".
"repository-id" as described in the OP would indicate that it is globally unique, this is not the case. You can have different "snapshot-ids" for the same repository but different storage / targets. You can probably backup the same repository as two different "snapshot-ids" to the same storage / target. I don't think there's much benefit in that so we'll write that off (nevertheless possible).
From what I understand the terminology from other backup programs don't map very well to duplicacy due to the fact you can have multiple "snapshot-ids" pointing to a given storage / target.
Another point of confusion is that the CLI takes the term "storage name" as "the collection of storage target, repository, and snapshot-id". So rather than being a name of a target, which I think is what the GUI does, since there is a global configuration for "storages", it's a unique index on a set of three columns. From the CLI perspective this makes sense (although maybe not the name) as it has already been mentioned that setting two different snapshot-ids within a single storage target for the same repository is a ridiculous use case.
Honestly, duplicacy works similar to git in some ways. Effectively the snapshot id "bk_1" is a git branch. The below steps can be performed against a different repository in order to create additional duplicacy "branches".
# While there are more steps in git what seems to be happening is relatively the same.
# duplicacy init bk_1 sftp://[email protected]/path/to/storage
git init
git remote add origin http://[email protected]/path/to/storage.git
git checkout --orphan bk_1
# Not valid due to empty branch.. But whatever..
git push --set-upstream origin bk_1
Performing a backup is the same as a commit.
# duplicacy backup
git add -A
git commit -m 'backup'
git tag revision-01
git push origin
It could be argued that this might apply to other backup solutions as well. I disagree. Duplicacy is unique in that a single storage / target can have multiple "snapshot-ids" which are effectively git branches. And they all benefit from the shared storage / target (dedupe, etc).
@gilbertchen Is there any plan to improve the terminology?