Overhaul CLI Ids
Current state
The but CLI defines and makes use of a type CliId which is used when:
- Displaying relevant objects in the terminal
- Referring to relevant objects in command arguments
The canonical examples here are but status and but rub <source> <destination>
Some of the types of objects that currently receive a CLI id are:
- Commits
- Branches
- Uncommitted files
- Files that are part of a commit
Ergonomics
The main purpose of the CLI ids is to offer an ergonomic way of referring to relevant items. For that reason, currently the CLI ids are made up of just 2 lowercase alpha characters (ease of typing). In the case of commit shas, the first 2 chars are taken directly, whereas for files and branch names, a 2 character id is generated on the fly.
Interop
It is also intended that the CLI IDs can be replaced with the full identifier - e.g you can use either 7d or 7d72456, and also either ve (generated ID) or Readme.md and either tw or my-branch.
Collision handling
Currently it is possible for the system to generate more than one "objects" with the same ID. When a CLI Id is being passed as an argument, if there is an ambiguity, the application will prompt the user to select which object they were referring to. The implementation is unfortunately not applied consistently across all sub-commands.
Issues with the current implementation
- Identifiers change / dont feel predictable: If an uncommitted file is amended into a commit (or vice versa), it gets a new ID. This also happens if the file is assigned to a different stack.
- There are bugs in handling the "full" identifiers (eg. branch names) where a generated id like
brcollides with a branch namedmy-branch(due to the br in branch) - The implementation for disambiguating CLI ids is not used uniformly for all relevant commands.
- Inefficiency - the current implementation loads the full workspace state to "discover" all CLI IDs in order to match a CLI id to an internal object - perhaps this can be implemented in a better way. (Note by @Byron: This is fast if expensive bits of information are skipped, which should be the default here)
New CLI ID goals
It would be good to take a step back and re-consider how the CLI ids function and how they are implemented. The key goals would be to:
- Address known existing bugs / limitations (addressing the sub-issues)
- Improve code quality & maintainability
- Improve ergonomics for users (eg. can we do better than we do now?)
- Make this scale for doing operations on Hunks as well
- @Byron's notes: can we have user journeys that are significantly simplified through CliIDs? How 'stable' would these have to be then?
- We can consider using the commit Change IDs when deriving CLI ids for commits, thus making them stable
- Consider having a "long-form" identifier that is unique in the workspace, i.e. using more characters, and using an abbreviated for in the console output. This long-form identifier, could be later used as part of the CLI json responses to make interactions more ergonomic
@Byron what do you think of this as a problem statement?
Thanks for summoning me :). I think it will help with the improvement of the current implementation. I added two notes of my own, with the main one being to know some user journeys that are significantly better thanks to CliIds. Maybe other solutions would also improve these journeys - particularly the relationship between CliIds and #11259.
got it! thanks for the edits :)
I added an item at the bottom for consideration:
Consider having a "long-form" identifier that is unique in the workspace, i.e. using more characters, and using an abbreviated for in the console output. This long-form identifier, could be later used as part of the CLI json responses to make interactions more ergonomic
If I think of ways to identify something, then we have the (I like the term) canonical identifier, like the full branch names, the full file name (absolute on disk), or full commit hashes. Some of these combine, so we can name a change in a commit, or a path in a Git tree at a commit. Syntax for the latter already exists, it's git rev-parse <hex-hash>:path/to/file.
These will always be good enough to identify the resource, without ambiguity, even though they can get unwieldy so a human won't want to use most of them.
This is where CliIDs come into play. I'd argue these are not for programmatic use, but for humans. So I have my doubts these make sense in JSON at all, because that would imply that humans consume JSON. Alternatively, I don't think we'd want programs to use these even as they have more failure modes. For instance, the right edit to the workspace could affect CliIDs so whatever the program thought it could use, it can't use anymore. Or now because of bad luck the CliIds still exist, but point to a different thing in the workspace, so the operation does something unexpected.
After writing this, I really have to double down on the notion that these CliIDs are only for humans, and they are so short-lived that we can assume that, in the worst case, any mutation will invalidate them. And if that's the case, then we could even hammer this home by printing some status information with the possibly changed CliIds after each mutation, so the user doesn't get the idea to scroll further up to copy older ones.
And if I remember correctly, that's exactly what jj does.
I think a case can be made that we don't actually have good canonical identifiers that are robust. In the context of a "but status" we may want to identify and distinguish between
- a branch named "my-stuff"
- uncommitted changes in file "my-stuff"
- the changes to the file "my-stuff" within a commit etc.
Of course, we can choose to invent a language/encoding for this with prefixes and combinations - but at this point we are inventing things anyways.
Im thinking - even in the JSON /scripting context, the reason for invoking in but status would be to get perhaps an item and then manipulate it with but rub - and it feels like we could do this by providing an 8 character version of the CLI id
I think a case can be made that we don't actually have good canonical identifiers that are robust. In the context of a "but status" we may want to identify and distinguish between
- a branch named "my-stuff"
- uncommitted changes in file "my-stuff"
- the changes to the file "my-stuff" within a commit etc.
Of course, we can choose to invent a language/encoding for this with prefixes and combinations - but at this point we are inventing things anyways.
Sorry, with "canonical" I do mean those maximally unique identifiers that are very, very hard to be ambiguous. And if they are anyway, we typically know their expected type due to their argument position in anything that isn't but rub.
Short names are inherently ambiguous, but we allow them with all the issues that entails. CliIDs as I envision them aren't actually ambiguous when used right after they were seen.
Im thinking - even in the JSON /scripting context, the reason for invoking in
but statuswould be to get perhaps an item and then manipulate it withbut rub- and it feels like we could do this by providing an 8 character version of the CLI id
This is where I am not following, mainly because JSON isn't anything I'd use as a human. And if a program uses it, it can always grab a field named "id" and use it, without caring how it actually looks or how long it is.
Maybe I am missing how all this will be used later, and you have to differentiate between
- the canonical identifier
- a short name of a branch
- a repo-relative path name
- a CWD-relative path name
- CliIDs
It's a but rub problem, and it has to spend the time to figure this out. But that doesn't mean that programs wouldn't… you know, I think programs know what they want when rubbing, so either they'd use a dedicated command, or they'd want to be able to control the operation that they perform by giving a "method" name of sorts. And then the type of the input is clear which further simplifies the work to be done there.
But let's figure out where I am not on the same page, because I couldn't change my mind yet and to me it seems clear what to do.
I do agree that the IDs that we put on the JSON objects should be "full" in the sense that they are unique. I think am mainly trying to think through if it is better for the json API to either:
A) Use generated ids for anything that we wish to be identifiable (e.g. uncommitted files, branches, etc)
B) Use part of (or the entire) natural name of the object as identifiers (e.g. commitsha:filepath to identify a change in a commit)
My hunch is that option B) doesn't scale for the next use case we are considering - being able to perform operations for hunks as well.
Conversely, to make the case in the opposite direction it may be the case that if the "ID" conformed to some format or schema, we may be able to resolve an ID to the object it identifies more quickly and efficiently (e.g. if we had a special char to indicate that an ID is referring to an uncommitted file, we dont need to search commits etc)