duplicacy Improving terminology: snapshot-id very misleading & more

This ticket is meant as a start to making the language usage in Duplicacy much easier to understand, so that more people will use Duplicacy. Finding an actual technical writer/editor with very strong English skills will help here (at the very least, the editor should have very strong English).

From the Documentation:

This init command connects the repository with the remote storage at 192.168.1.00 via SFTP. It will initialize the remote storage if this has not been done before. It also assigns the snapshot id mywork to the repository. This snapshot id is used to uniquely identify this repository if there are other repositories that also back up to the same storage. You can now create snapshots of the repository by invoking the backup command.

This gives us the following misleading language:

Terms: ** repository: something we want to back up ** storage: a destination for the backups ** snapshot-id: a unique identifier for the repository ** snapshots: backups of the repository
Relations: ** One or more repositories can share a storage. ** Each repository has a snapshot-id. ** Snapshots of the repository are make to the storage. ** Each snapshot has a revision.

In common usage, it would make sense that a repository has a repository-id, and a snapshot has a snapshot-id.

Initial changes:

The term snapshot-id should be replaced with repository-id.
Repository is used in other programs as a destination for backups, this could cause confusion.
There is no such word as Storages. Storage is an uncountable noun. The documentation usages of Storages make more sense as Storage Types or Storage Targets

Consider this as potential improved language:

Source: The content source for the backup
Destination: A destination target for the backup
Destination Type: The type of the destination (such local-file, S3, SFTP), possibly with parameters.
Each source has a unique source-id.
Backups are snapshots of the source, made to the one or more destinations.
Each snapshot has a revision.

Sep 03 '17 02:09 robbat2

The proposed terminology is a lot easier to understand 👍

Sep 03 '17 09:09 jbrodriguez

I will work on this tonight and push a PR as soon as I can. Thanks for the suggestions.

Sep 04 '17 21:09 flamingm0e

Sorry for not responding earlier -- I just came back from a multiple-day trip without access to a computer.

I agree the current terminology, especially the use of repository/storage, may cause a lot of confusion. However, I chose them because there wasn't a better alternative. I don't like source/destination, because they are too general. Any better term for the directory to be backed up other than repository?

Another reason I chose repository because it is used by git/hg. The command model of Duplicacy closely follows that of git/hg and I think this term is appropriate in the sense that you can put everything want to back up here and they will be backed up automatically.

I'm open to suggestions. But source/destination do not sound like a good choice for me.

Sep 05 '17 03:09 gilbertchen

In the Git model, the repository IS the destination. git commit takes something that you staged from your working directory and puts it into the repository; git push copies from your local repository to a remote repository.

I do see your concern that source/destination have a potential to be too generic. Let's see if there is any other common terms shared by other backup apps that we can leverage.

Sep 05 '17 16:09 robbat2

how about repository/depot?

Sep 05 '17 18:09 gilbertchen

No, I think get away from repository entirely, since it doesn't say where it is in the Git/Hg sense.

Sep 05 '17 19:09 robbat2

I agree with moving away from the term repository. It was confusing to me also. I disagree with @gilbertchen that source (or even source data) are inferior terms to repository. I find the latter confusing, as it seems do at least some others.

As I'm hopeful that one day in the not-too-distant-future duplicacy will support backing up data on stdin (e.g. backing up the output of tar). This would argue further for a generic term like source.

So I think the proposed terms in @robbat2's OP are good.

Sep 05 '17 21:09 level323

@robbat2 Thank you for beginning to take us down this path; starting this discussion has been on my to-do list for months, but I've just never found the time to do it.

I also think that there is value to creating a graphic that depicts all of the terms, since a picture is often worth a thousand words, so hopefully that can be one of the changes made as a result of this issue.

@gilbertchen I disagree that it's better to use terms that are misleading/confusing than to use terms that are overly general. Neither is ideal, and I fully support trying to find terms that everyone agrees are both precise and accurate (and taking time to search for them rather than leaping immediately at the first propose solution as soon as it's set forth, because I'm sure there will be a non-trivial effort to update the code to make it match the terms chosen), but if we have to choose between the two, I choose accuracy over precision.

If we're trying to draw parallels with Git's lexicon, the terms we'd be using would be local and remote repositories. But I don't think that we necessarily should be trying to use Git's vocabulary as a conscious design goal, because 1) Git has a notion of all repositories being peers/equivalent with bi-directional updates between them whereas backup software has a clear notion of the original and the one or more derived copies where updates (not restores) are unidirectional, and 2) we're not using Git's verbs such as push and pull, for good reason. We'd be better off consulting the vocabulary of other backup software (CrashPlan, Duplicity, etc.) for inspiration rather than simply mirroring Git's.

That was all pretty abstract, but here are my suggestions for a set of vocabulary terms to use:

Backup Set: The content source for the backup
Storage Location: A destination target for the backup
Storage Location Type: The type of the destination (such as local-file, S3, SFTP), possibly with parameters.
Snapshot: Same as current definition: A set of files that collectively represent the current state of the backup set at a defined point in time.
Each backup set has a unique backup-set-id.
Each storage location has a unique storage-location-id.
Each snapshot has a unique snapshot-id, which is the same no matter how many storage locations the snapshot is written to.
Backups are the act of creating snapshots of the backup set and writing them to the one or more storage locations.
The term "revision" is struck from the vocabulary.

Those terms are more in line with the terms I've seen used by other backup software products rather than version control products. And the only one that might not be intuitive to someone who's never looked at a backup product before is "backup set," but it's the term used by other products (e.g. CrashPlan) and it makes sense once it's explained (so users might be confused the first time they see it but will remember what it means thereafter).

Sep 10 '17 03:09 tbain98

As a suggestion, so we don't pollute this issue a lot: why not make a google excel sheet, and there put the current term used and in the following columns possible suggestions, maybe even with something like a column storing votes for each alternative, and a comments column.

In this way we can way-way-way easier check the suggestions and discuss over them.

(offtopic: sorry for gdrive retry implementation being so late, i was out of country, and im doing 10hours @ work. i'm not quite in the mood to work on anything after coming home :-s )

Sep 10 '17 11:09 TheBestPessimist

How about "Subject" for the thing that is to be backed up? Or "Data collection"?

"Source" is OK for me, too, as it's where the data comes from: it's the source of the data handled by the backup process.

I find "Backup set" confusing, as it's not clear whether this is a set of things to be backed up, or the set of things that form the backup.

I like Storage and Snapshot, they make sense.

[By the way, I'm another Crashplan refugee. I'm very pleased to have discovered Duplicacy, and to find that it works so well! The terminology is a minor issue, it doesn't take long to work out how to back up and restore, and what the terms mean. The excellent de-duplication is worth a fortune!]

Sep 24 '17 20:09 Fonant

Why there is a need to invent new terminology here? Duplicacy is a backup program and it should really adopt very similar language, used by other backup programs - this will improve user experience. Take some widely used backup applications, analyze what terms they use to call different objects they work on and apply here. Duplicacy will differentiate itself with technology, not terminology.

Sep 25 '17 01:09 dgcom

I started a googledoc here to collect existing terminology. https://docs.google.com/spreadsheets/d/14-ZJyWGgw5jt163Jh_3Wkq8jRY25R-TfPYUfxAg-x1U/edit?usp=sharing

Sep 25 '17 23:09 robbat2

Glad I found this thread.

Just finished reading the documentation (June '19) and had exactly the same reaction as @robbat2.

Any reason why documentation cannot be updated to include the industry-standard terminology suggested by @tbain98?

Jun 06 '19 05:06 gtusr

Same here as all the others - looking at prune documentation, I started doubting everything I knew about duplicacy, backups, and the world. https://forum.duplicacy.com/t/prune-command-details/1005

OPTIONS:
   -id <snapshot id>            delete snapshots with the specified id instead of the default one
   -all, -a                     match against all snapshot IDs
   -r <revision> [+]            delete snapshots with the specified revisions
   -t <tag> [+]                 delete snapshots with the specified tags

Aug 06 '19 21:08 archon810

I was led here to this issue thread as I went searching Google with the exact same confusion around the prune command. Three years on, this one needs to be resolved.

Mar 13 '20 12:03 ilium007

I'm new to duplicacy and still learning the bits, but, assuming my understanding is correct, there's a relationship that's missing in this discussion. A "snapshot-id" is a unique identifier for a repository for a given storage. If you think about it in those terms it makes more sense to call it "snapshot-id".

"repository-id" as described in the OP would indicate that it is globally unique, this is not the case. You can have different "snapshot-ids" for the same repository but different storage / targets. You can probably backup the same repository as two different "snapshot-ids" to the same storage / target. I don't think there's much benefit in that so we'll write that off (nevertheless possible).

From what I understand the terminology from other backup programs don't map very well to duplicacy due to the fact you can have multiple "snapshot-ids" pointing to a given storage / target.

Another point of confusion is that the CLI takes the term "storage name" as "the collection of storage target, repository, and snapshot-id". So rather than being a name of a target, which I think is what the GUI does, since there is a global configuration for "storages", it's a unique index on a set of three columns. From the CLI perspective this makes sense (although maybe not the name) as it has already been mentioned that setting two different snapshot-ids within a single storage target for the same repository is a ridiculous use case.

Honestly, duplicacy works similar to git in some ways. Effectively the snapshot id "bk_1" is a git branch. The below steps can be performed against a different repository in order to create additional duplicacy "branches".

# While there are more steps in git what seems to be happening is relatively the same.
# duplicacy init bk_1 sftp://[email protected]/path/to/storage
git init
git remote add origin http://[email protected]/path/to/storage.git
git checkout --orphan bk_1
# Not valid due to empty branch.. But whatever..
git push --set-upstream origin bk_1

Performing a backup is the same as a commit.

# duplicacy backup
git add -A
git commit -m 'backup'
git tag revision-01
git push origin

It could be argued that this might apply to other backup solutions as well. I disagree. Duplicacy is unique in that a single storage / target can have multiple "snapshot-ids" which are effectively git branches. And they all benefit from the shared storage / target (dedupe, etc).

Jun 03 '20 16:06 Sxderp

@gilbertchen Is there any plan to improve the terminology?

Jul 31 '24 11:07 riobard

duplicacy duplicacy copied to clipboard

Improving terminology: snapshot-id very misleading & more

duplicacy
duplicacy copied to clipboard