dvc
dvc copied to clipboard
artifacts: Review naming restrictions
The current name restriction seems too limiting (and the reasons for the current restrictions are not clear cc @aguschin ) :
https://github.com/iterative/dvc/blob/c71b07b99622ab37138d6d63f611a532c1dda142/dvc/repo/artifacts.py#L20
It has come up in different places:
https://iterativeai.slack.com/archives/CB41NAL8H/p1691411376466919 https://github.com/iterative/studio/issues/6897#issuecomment-1670034865
I think we should first clarify the reasons for the restrictions and then decide on allowing additional cases
Edit: go to https://github.com/iterative/dvc/issues/9821#issuecomment-1747205082
@aguschin @omesser Do you have any more background on why we choose these limitations for artifact names? I recall trying to be compatible with git tags, but we are being more restrictive now than what git tags allow (no uppercase letters, _, ., etc.)
Couldn't find specific reasons for _ for . actually.
I'm guessing it's a restriction in the artifacts section in dvc.yaml (where these might be keys š ).
@aguschin - had a hard time digging for the reason in GTO issue history, can you take a look please and tell us what prevented underscores, dots and capital letters?
That limitation was chosen just for simplicity: better to have it limited and lift it later than to allow everything and decide that we need to remove some characters.
So technically I think we can allow extra characters that don't interfere with gto special signs (@ and # and !) and with git tags limitations. The only other reason I can think of that worth considering is readability.
@aguschin Are there any restrictions within gto now (besides git tag limitations), or is it only enforced at the dvc level?
@dberenbaum it's restricted on the GTO side AFAIK.
We can change things on GTO side as well IMO if there's no deep reason not to (cc @aguschin)
Waiting on #9770 so we can use gto as a dep here and not duplicate the regex and other logic in dvc
Next steps (Updated after further discussions in following comments):
- Immediate fixes:
- [x] Allow uppercase letters in model names in GTO
- [x] Merge PR Add A Model: Validate automated model name suggestion. It addresses the immediate bug.
- [x] Allow uppercase letters in model names in Studio BE
- [x] Allow uppercase letters in model names in Studio FE
- Later:
- [ ] Remove all naming restrictions (except for those that result in an invalid yaml). Do this in all products (DVC, Studio, GTO). Update validation rules and error messages accordingly.
- [ ] In GTO, before creating Git tag for an action, check if a Git tag with the same name (in any casing) exists. If yes, fail the action.
- [ ] In DVC, disallow having two artifacts in the same
dvc.yamlwith names that are only different in casing. - [ ] Make it clear in the UI and docs that models/dirs/stages with the same name but different casing cannot exist in the repo.
- [ ] In GTO, when creating Git tags for model actions, escape characters that arenāt allowed in Git refs. When returning the tags to the user/Studio, unescape them. Decide the escape characters and method.
- [ ] In GTO, for all model actions, run the check on a tag name before creating the tag.
Context
Discussions related to this issue are spread out in multiple GH issues and Slack threads. Iāve tried to summarize them below. Do the next steps make sense?
-
Currently, the restriction on model (artifact) name is quite strict. We are discussing how much we should relax it.
-
Issue reports and discussions:
- Reported by @mnrozhkov
- PACCAR issue
- Reported by @tibor-mach
- @dberenbaum recently added support for
_ - In the above PR, we couldnāt agree on how to handle upper case letters, even though we agreed that we should allow uppercase letters in the model names.
- Pending task from the PACCAR issue: fail fast in DVC (now it allows parsing even bad names)
- In Add A Model: Validate automated model name suggestion, @jellebouwman is adding validation for model names entered in Studio.
- Discussion to Allow uppercase letters
- Discussion to artifacts: Review naming restrictions
- Issue report: Model Registry: Misleading errors for disallowed model names
-
Questions:
- One item suggested by @shcheklein is āmake DVC fail - so that people canāt create (annotate) model with names that we donāt support in GTOā. Do we need this if we remove all restrictions? - UPDATE: based on the following comments - this isn't needed.
- Allow uppercase letters in model names in Studio and GTO
There is a PR for this here
- In GTO, when creating/working with Git tags for model actions, convert uppercase letters to lowercase
The above PR doesn't handle this. If we want it, we should probably add it in the same release so we don't have to make breaking changes.
- One item suggested by @shcheklein is āmake DVC fail - so that people canāt create (annotate) model with names that we donāt support in GTOā. Do we need this if we remove all restrictions?
No, I think we agreed here that it won't be needed in that case.
The above PR doesn't handle this. If we want it, we should probably add it in the same release so we don't have to make breaking changes.
I think if we agree that we escape symbols, I would not do any other transformations. For upper case symbols -if GH accepts is - let's push it. WDYT?
I see that GH accepts it (see gto-tag and GTO-tag in https://github.com/iterative/lstm_seq2seq/tags), but would have to check BB/GL. More importantly, I'm finding issues in git cli. Sometimes, it seems to accept a new tag with different casing, other times I get an error that the tag already exists:
$ git tag gto-tag
$ git tag GTO-tag
fatal: tag 'GTO-tag' already exists
$ git tag -l
gto-tag
I'm not sure why it's happening, but feels like this is too fragile to depend on.
I'm not sure why it's happening, but feels like this is too fragile to depend on.
It's because you are on a platform with case-insensitive filesystems (windows or macos). Git ref names are technically case sensitive, and this would work as expected on linux (where the filesystems are case sensitive). But on windows/mac, this may or may not fail, depending on the state of your git repo.
(It fails if git tries to create the files .git/refs/tags/gto-tag, .git/refs/tags/GTO-tag, since on windows/mac both of these filenames point to the same file. However, it can actually succeed on windows/mac in the event that one of those tags is stored in a packed ref instead of in the .git/refs/tags/... file)
This is only an issue when creating a tag that may overlap with an existing one though. There's not really any reason to prevent users from using capitalized letters in the tag name.
This is only an issue when creating a tag that may overlap with an existing one though. There's not really any reason to prevent users from using capitalized letters in the tag name.
My concern would be what happens when users do create an overlapping tag since gto will see them as separate artifacts but may fail to create the tag, right? It is pretty unlikely, so maybe just warning about this or catching the exception is enough.
Let's say I have a tag like My-Tag. What happens if I create the tag on a case-insensitive filesystem and then push it? Will I lose the initial casing? If I pull on linux and try to checkout My-Tag, will it fail?
My concern would be what happens when users do create an overlapping tag since gto will see them as separate artifacts but may fail to create the tag, right?
Yes, this is correct (note that "may fail" is significant since it depends on whether or not the existing tag ref is packed)
Let's say I have a tag like
My-Tag. What happens if I create the tag on a case-insensitive filesystem and then push it? Will I lose the initial casing? If I pull on linux and try to checkoutMy-Tag, will it fail?
No, git is case sensitive. Git can tell the difference between a tag named My-Tag and my-tag, and considers them to be two different tags. On a linux machine, the git client will always behave as expected.
The issue is specifically with what happens on windows/mac. In the event that you have an unpacked/loose ref file named .git/tags/My-Tag, you cannot create a new tag named .git/tags/my-tag (since windows/mac filesystems will report that the file already exists). Git tag lookups are still case sensitive though, so in the case where you have both a set of packed refs containing the tag My-Tag, and a loose ref file named .git/tags/my-tag, Git will checkout the correct one if you do git checkout My-Tag.
Because of this weirdness, simplified best-practice for handling git refs in a cross-platform project is to just always use lowercase for branch/tag names.
However, since GTO tags can include actual directory paths, if you wanted to enforce a lowercase-only rule for GTO tags, you would have to extend the rule to require that users only store artifacts in directories with lowercase paths (unless the user only cares about windows/mac).
Given a GTO tag like path/to:artifact#v1.0.0, and an artifact entry stored in Path/To/dvc.yaml:
- This will work on windows/mac since
open('path/to/dvc.yaml')will succeed (path/to/andPath/To/are the same dir) - This will not work on linux since
open('path/to/dvc.yaml')will fail saying the file does not exist (path/to/andPath/To/are not the same dir)
Also keep in mind that this gets even worse taking into consideration that directory paths can contain non-ascii characters (i.e. latin characters with accents, Cyrillic, CJK, etc) , and how git stores those characters is also dependent on platform + system encoding settings.
i.e. given something like cafƩ/dvc.yaml, on windows/mac/linux git may or may not encode cafƩ differently in all 3 cases.
Considering the above comments, would this be better?
- do not change the casing
- before creating Git tag for an action,
- check if a Git tag with the same name (in any casing) exists
- if yes, fail the action (coz the Git tag creation might fail depending on the user's platform) This is basically a restriction that models/dirs/stages with the same name but different casing cannot not exist in the repo. We should make this restriction clear in the UI and docs.
I think that makes sense @tapadipti. Can we try to fail early and disallow having two artifacts in the same dvc.yaml with names that are only different in casing?
Edit: also, in case it's unclear, this seems like an unlikely scenario, so I don't think it should block us moving forward with allowing uppercase.
Discussed this with @dberenbaum a short while ago.
Can we try to fail early and disallow having two artifacts in the same dvc.yaml with names that are only different in casing?
- [ ] We should implement this change in dvc.
- [ ] It's good to still add the check in gto also. (For the rare case when 2
dvc.yamlfiles have the same path but different casing, and contain models with the same name.)
I don't think it should block us moving forward with allowing uppercase.
Allowing uppercase is already done.
I've updated the next steps in this comment. ptal.
Let's close this one and track there.
- In GTO, when creating Git tags for model actions, escape characters that arenāt allowed in Git refs. When returning the tags to the user/Studio, unescape them. Decide the escape characters and method.
Can we do this at the same time we remove all naming restrictions? Otherwise, I think it could introduce new errors and be a breaking change when we implement escape characters.
Also, we only need to escape characters disallowed in Git refs, but it's possible it will be simpler to take some existing escaping library and escape all characters that it handles even if many of them would be valid Git refs.
Let's close this one and track there.
The comment with tasks is in this issue itself. So I'm re-opening the issue.
In GTO, when creating Git tags for model actions, escape characters that arenāt allowed in Git refs. When returning the tags to the user/Studio, unescape them. Decide the escape characters and method.
Can we do this at the same time we remove all naming restrictions?
š
it's possible it will be simpler to take some existing escaping library and escape all characters that it handles even if many of them would be valid Git ref
Agreed. I'll search for what library to use. If you have any suggestions, pls share.
@shcheklein mentioned URL escaping, and we could use urllib.parse.quote for that.
@shcheklein You mentioned that GTO "created and pushed the tag before actually running a check on the tag name". I'm looking into the code and I can see that the assert_fullname_is_valid() function is called on the full model name before create_tag() is called. Can you pls share which check you were referring to?
UPDATE: Discussed this with @shcheklein. I tried the register action. The issue could be in some other actions (eg, deprecate).
Hello! GTO got upgraded, DVC is using the latest GTO version, and Studio BE depends on the latest DVC version. https://github.com/iterative/studio/issues/7903 incorporates the new validation rules when adding models on the Studio FE. FE changes will be released today.
@tapadipti Do you plan to start on the "Later" items in https://github.com/iterative/dvc/issues/9821#issuecomment-1747205082? Is there anything else needed before that?
Do you plan to start on the "Later" items in https://github.com/iterative/dvc/issues/9821#issuecomment-1747205082? Is there anything else needed before that?
I started working on one of the items in GTO (here), and will gradually pick the others. But probably not the DVC/Studio tasks.
Is this still p1? @tapadipti is it still in your plans?
Lowering the priority of the remaining items