dvc icon indicating copy to clipboard operation
dvc copied to clipboard

artifacts: Review naming restrictions

Open daavoo opened this issue 2 years ago • 30 comments

The current name restriction seems too limiting (and the reasons for the current restrictions are not clear cc @aguschin ) :

https://github.com/iterative/dvc/blob/c71b07b99622ab37138d6d63f611a532c1dda142/dvc/repo/artifacts.py#L20

It has come up in different places:

https://iterativeai.slack.com/archives/CB41NAL8H/p1691411376466919 https://github.com/iterative/studio/issues/6897#issuecomment-1670034865

I think we should first clarify the reasons for the restrictions and then decide on allowing additional cases


Edit: go to https://github.com/iterative/dvc/issues/9821#issuecomment-1747205082

daavoo avatar Aug 08 '23 17:08 daavoo

@aguschin @omesser Do you have any more background on why we choose these limitations for artifact names? I recall trying to be compatible with git tags, but we are being more restrictive now than what git tags allow (no uppercase letters, _, ., etc.)

dberenbaum avatar Aug 08 '23 17:08 dberenbaum

Couldn't find specific reasons for _ for . actually. I'm guessing it's a restriction in the artifacts section in dvc.yaml (where these might be keys šŸ˜„ ). @aguschin - had a hard time digging for the reason in GTO issue history, can you take a look please and tell us what prevented underscores, dots and capital letters?

omesser avatar Aug 08 '23 22:08 omesser

That limitation was chosen just for simplicity: better to have it limited and lift it later than to allow everything and decide that we need to remove some characters.

So technically I think we can allow extra characters that don't interfere with gto special signs (@ and # and !) and with git tags limitations. The only other reason I can think of that worth considering is readability.

aguschin avatar Aug 09 '23 04:08 aguschin

@aguschin Are there any restrictions within gto now (besides git tag limitations), or is it only enforced at the dvc level?

dberenbaum avatar Aug 10 '23 18:08 dberenbaum

@dberenbaum it's restricted on the GTO side AFAIK.

shcheklein avatar Aug 10 '23 18:08 shcheklein

We can change things on GTO side as well IMO if there's no deep reason not to (cc @aguschin)

omesser avatar Aug 11 '23 14:08 omesser

Waiting on #9770 so we can use gto as a dep here and not duplicate the regex and other logic in dvc

dberenbaum avatar Sep 08 '23 20:09 dberenbaum

Next steps (Updated after further discussions in following comments):

  • Immediate fixes:
  • Later:
    • [ ] Remove all naming restrictions (except for those that result in an invalid yaml). Do this in all products (DVC, Studio, GTO). Update validation rules and error messages accordingly.
    • [ ] In GTO, before creating Git tag for an action, check if a Git tag with the same name (in any casing) exists. If yes, fail the action.
    • [ ] In DVC, disallow having two artifacts in the same dvc.yaml with names that are only different in casing.
    • [ ] Make it clear in the UI and docs that models/dirs/stages with the same name but different casing cannot exist in the repo.
    • [ ] In GTO, when creating Git tags for model actions, escape characters that aren’t allowed in Git refs. When returning the tags to the user/Studio, unescape them. Decide the escape characters and method.
    • [ ] In GTO, for all model actions, run the check on a tag name before creating the tag.
Context

Discussions related to this issue are spread out in multiple GH issues and Slack threads. I’ve tried to summarize them below. Do the next steps make sense?

tapadipti avatar Oct 04 '23 15:10 tapadipti

  • Allow uppercase letters in model names in Studio and GTO

There is a PR for this here

  • In GTO, when creating/working with Git tags for model actions, convert uppercase letters to lowercase

The above PR doesn't handle this. If we want it, we should probably add it in the same release so we don't have to make breaking changes.

  • One item suggested by @shcheklein is ā€œmake DVC fail - so that people can’t create (annotate) model with names that we don’t support in GTOā€. Do we need this if we remove all restrictions?

No, I think we agreed here that it won't be needed in that case.

dberenbaum avatar Oct 04 '23 18:10 dberenbaum

The above PR doesn't handle this. If we want it, we should probably add it in the same release so we don't have to make breaking changes.

I think if we agree that we escape symbols, I would not do any other transformations. For upper case symbols -if GH accepts is - let's push it. WDYT?

shcheklein avatar Oct 04 '23 18:10 shcheklein

I see that GH accepts it (see gto-tag and GTO-tag in https://github.com/iterative/lstm_seq2seq/tags), but would have to check BB/GL. More importantly, I'm finding issues in git cli. Sometimes, it seems to accept a new tag with different casing, other times I get an error that the tag already exists:

$ git tag gto-tag
$ git tag GTO-tag
fatal: tag 'GTO-tag' already exists
$ git tag -l
gto-tag

I'm not sure why it's happening, but feels like this is too fragile to depend on.

dberenbaum avatar Oct 04 '23 19:10 dberenbaum

I'm not sure why it's happening, but feels like this is too fragile to depend on.

It's because you are on a platform with case-insensitive filesystems (windows or macos). Git ref names are technically case sensitive, and this would work as expected on linux (where the filesystems are case sensitive). But on windows/mac, this may or may not fail, depending on the state of your git repo.

(It fails if git tries to create the files .git/refs/tags/gto-tag, .git/refs/tags/GTO-tag, since on windows/mac both of these filenames point to the same file. However, it can actually succeed on windows/mac in the event that one of those tags is stored in a packed ref instead of in the .git/refs/tags/... file)


This is only an issue when creating a tag that may overlap with an existing one though. There's not really any reason to prevent users from using capitalized letters in the tag name.

pmrowla avatar Oct 05 '23 00:10 pmrowla

This is only an issue when creating a tag that may overlap with an existing one though. There's not really any reason to prevent users from using capitalized letters in the tag name.

My concern would be what happens when users do create an overlapping tag since gto will see them as separate artifacts but may fail to create the tag, right? It is pretty unlikely, so maybe just warning about this or catching the exception is enough.

dberenbaum avatar Oct 05 '23 01:10 dberenbaum

Let's say I have a tag like My-Tag. What happens if I create the tag on a case-insensitive filesystem and then push it? Will I lose the initial casing? If I pull on linux and try to checkout My-Tag, will it fail?

dberenbaum avatar Oct 05 '23 01:10 dberenbaum

My concern would be what happens when users do create an overlapping tag since gto will see them as separate artifacts but may fail to create the tag, right?

Yes, this is correct (note that "may fail" is significant since it depends on whether or not the existing tag ref is packed)

Let's say I have a tag like My-Tag. What happens if I create the tag on a case-insensitive filesystem and then push it? Will I lose the initial casing? If I pull on linux and try to checkout My-Tag, will it fail?

No, git is case sensitive. Git can tell the difference between a tag named My-Tag and my-tag, and considers them to be two different tags. On a linux machine, the git client will always behave as expected.

The issue is specifically with what happens on windows/mac. In the event that you have an unpacked/loose ref file named .git/tags/My-Tag, you cannot create a new tag named .git/tags/my-tag (since windows/mac filesystems will report that the file already exists). Git tag lookups are still case sensitive though, so in the case where you have both a set of packed refs containing the tag My-Tag, and a loose ref file named .git/tags/my-tag, Git will checkout the correct one if you do git checkout My-Tag.

pmrowla avatar Oct 05 '23 01:10 pmrowla

Because of this weirdness, simplified best-practice for handling git refs in a cross-platform project is to just always use lowercase for branch/tag names.

However, since GTO tags can include actual directory paths, if you wanted to enforce a lowercase-only rule for GTO tags, you would have to extend the rule to require that users only store artifacts in directories with lowercase paths (unless the user only cares about windows/mac).

Given a GTO tag like path/to:artifact#v1.0.0, and an artifact entry stored in Path/To/dvc.yaml:

  • This will work on windows/mac since open('path/to/dvc.yaml') will succeed (path/to/ and Path/To/ are the same dir)
  • This will not work on linux since open('path/to/dvc.yaml') will fail saying the file does not exist (path/to/ and Path/To/ are not the same dir)

Also keep in mind that this gets even worse taking into consideration that directory paths can contain non-ascii characters (i.e. latin characters with accents, Cyrillic, CJK, etc) , and how git stores those characters is also dependent on platform + system encoding settings.

i.e. given something like cafƩ/dvc.yaml, on windows/mac/linux git may or may not encode cafƩ differently in all 3 cases.

pmrowla avatar Oct 05 '23 01:10 pmrowla

Considering the above comments, would this be better?

  • do not change the casing
  • before creating Git tag for an action,
    • check if a Git tag with the same name (in any casing) exists
    • if yes, fail the action (coz the Git tag creation might fail depending on the user's platform) This is basically a restriction that models/dirs/stages with the same name but different casing cannot not exist in the repo. We should make this restriction clear in the UI and docs.

tapadipti avatar Oct 05 '23 03:10 tapadipti

I think that makes sense @tapadipti. Can we try to fail early and disallow having two artifacts in the same dvc.yaml with names that are only different in casing?

Edit: also, in case it's unclear, this seems like an unlikely scenario, so I don't think it should block us moving forward with allowing uppercase.

dberenbaum avatar Oct 05 '23 11:10 dberenbaum

Discussed this with @dberenbaum a short while ago.

Can we try to fail early and disallow having two artifacts in the same dvc.yaml with names that are only different in casing?

  • [ ] We should implement this change in dvc.
  • [ ] It's good to still add the check in gto also. (For the rare case when 2 dvc.yaml files have the same path but different casing, and contain models with the same name.)

I don't think it should block us moving forward with allowing uppercase.

Allowing uppercase is already done.

tapadipti avatar Oct 05 '23 14:10 tapadipti

I've updated the next steps in this comment. ptal.

tapadipti avatar Oct 05 '23 15:10 tapadipti

Let's close this one and track there.

dberenbaum avatar Oct 05 '23 18:10 dberenbaum

Can we do this at the same time we remove all naming restrictions? Otherwise, I think it could introduce new errors and be a breaking change when we implement escape characters.

Also, we only need to escape characters disallowed in Git refs, but it's possible it will be simpler to take some existing escaping library and escape all characters that it handles even if many of them would be valid Git refs.

dberenbaum avatar Oct 05 '23 18:10 dberenbaum

Let's close this one and track there.

The comment with tasks is in this issue itself. So I'm re-opening the issue.

In GTO, when creating Git tags for model actions, escape characters that aren’t allowed in Git refs. When returning the tags to the user/Studio, unescape them. Decide the escape characters and method.

Can we do this at the same time we remove all naming restrictions?

šŸ‘

it's possible it will be simpler to take some existing escaping library and escape all characters that it handles even if many of them would be valid Git ref

Agreed. I'll search for what library to use. If you have any suggestions, pls share.

tapadipti avatar Oct 06 '23 03:10 tapadipti

@shcheklein mentioned URL escaping, and we could use urllib.parse.quote for that.

dberenbaum avatar Oct 06 '23 12:10 dberenbaum

@shcheklein You mentioned that GTO "created and pushed the tag before actually running a check on the tag name". I'm looking into the code and I can see that the assert_fullname_is_valid() function is called on the full model name before create_tag() is called. Can you pls share which check you were referring to?

UPDATE: Discussed this with @shcheklein. I tried the register action. The issue could be in some other actions (eg, deprecate).

tapadipti avatar Oct 10 '23 11:10 tapadipti

Hello! GTO got upgraded, DVC is using the latest GTO version, and Studio BE depends on the latest DVC version. https://github.com/iterative/studio/issues/7903 incorporates the new validation rules when adding models on the Studio FE. FE changes will be released today.

jellebouwman avatar Oct 12 '23 12:10 jellebouwman

@tapadipti Do you plan to start on the "Later" items in https://github.com/iterative/dvc/issues/9821#issuecomment-1747205082? Is there anything else needed before that?

dberenbaum avatar Oct 12 '23 12:10 dberenbaum

Do you plan to start on the "Later" items in https://github.com/iterative/dvc/issues/9821#issuecomment-1747205082? Is there anything else needed before that?

I started working on one of the items in GTO (here), and will gradually pick the others. But probably not the DVC/Studio tasks.

tapadipti avatar Oct 12 '23 13:10 tapadipti

Is this still p1? @tapadipti is it still in your plans?

dberenbaum avatar Nov 27 '23 14:11 dberenbaum

Lowering the priority of the remaining items

dberenbaum avatar Dec 12 '23 12:12 dberenbaum