rfcs icon indicating copy to clipboard operation
rfcs copied to clipboard

Replace `resolved` field by `hash`

Open doxiaodong opened this issue 8 years ago • 24 comments

https://github.com/yarnpkg/yarn/issues/3330

doxiaodong avatar May 09 '17 05:05 doxiaodong

We may want to split the discussion into several RFCs. Feel free to spawn your own version

bestander avatar May 11 '17 17:05 bestander

In my company we had a bit different issue. It's a bit complicated ;)

We've got our own registry that stores our own private scoped packages and also acts as a proxy to the public npm/yarn registry for public packages. The problem is that developers could configure yarn in 2 ways:

  1. use our own registry for all packages
  2. use public npm/yarn registry for public packages and our private registry for our private scoped packages (this setting is useful if someone publishes public packages to public npm registry)

Then we noticed that this causes a mess in yarn.lock files, because after upgrading public packages, some developers had resolved urls pointing to the public registry and some to our private one. The second setting is better for our developers, but we'd prefer that our CI always used our own registry... With urls in yarn.lock file, it's not really possible.

szimek avatar May 12 '17 05:05 szimek

I think storing hash instead of the whole url is good for keeping yarn's logic simple & robust. Caching url may gain some performance benefit, but that forces yarn to deal with the situation that different registries exists in one project because that may happen.

Supporting multiple registries in one project is a huge challenge, considering different use cases in different team. I don't think yarn is ready to face it when yarn includes the field resolved in yarn.lock file.

IMP, we should abandon this feature (storing url including registry for each package) first. Then reconsider the use cases & support it in a more reasonable way. Of course keeping compatibility will be the hardest part. Maybe there have to be a breaking change.

Just for more discussion:)

nighca avatar May 12 '17 08:05 nighca

In my company we had a bit different issue. It's a bit complicated ;)

@szimek we have a similar setup at our company.

We have two internal registries, once that is purely a mirror of the public npm registry, and a second registry that contains just our company's scope, @mycompany (as an example).

We then have developers setup the following in their .npmrc file:

registry=http://artifactory.example.com/artifactory/api/npm/npm
@mycompany:registry=http://artifactory.example.com/artifactory/api/npm/npm-mycompany/

We decided against a virtual registry built on top of the other two (We are not sure whether we will ever create the single virtual registry).

because after upgrading public packages, some developers had resolved urls pointing to the public registry and some to our private one.

This happens at our company as well. We don't force developers to use the registries I listed above (except for the scoped registry, but that's the only place those are published).

but we'd prefer that our CI always used our own registry

Eventually we plan on taking the Facebook approach and isolating our CI environment from the internet, which means that any external references will fail those CI jobs.

ghost avatar May 12 '17 16:05 ghost

In terms of complexity I would not sign up on this drastic change yet. There are many hidden drawbacks not listed in this RFC and code complexity and compatibility need to be considered.

I suggest focusing on specific use cases and find a way to address them with the existing structure instead of starting a significant revamp.

bestander avatar May 18 '17 16:05 bestander

Nay.

Changing resolved to hash will break backwards-compatibility.

@doxiaodong As to your issue - registry mirror config may be overridden by resolved in yarn.lock, it be addressed and improved by yarn implementation. I think changes to RFC may not be necessary.

OpenGG avatar Jun 02 '17 05:06 OpenGG

registry mirror config may be overridden by resolved in yarn.lock

@OpenGG are you saying that we should manually replace registry URL paths in yarn.lock?

ghost avatar Jun 02 '17 16:06 ghost

@destroyerofbuilds No. That's a temporary hack with current yarn implementation.

NPM@5 introduces interesting changes related to this issue, I suggest we all give it a try before jumping to any conclusion.

If you generated your package lock against registry A, and you switch to registry B, npm will now try to install the packages from registry B, instead of A. If you want to use different registries for different packages, use scope-specific registries (npm config set @myscope:registry=https://myownregist.ry/packages/). Different registries for different unscoped packages are not supported anymore. http://blog.npmjs.org/post/161081169345/v500

OpenGG avatar Jun 02 '17 17:06 OpenGG

Yep, that is what we need to do

bestander avatar Jun 02 '17 17:06 bestander

If you generated your package lock against registry A, and you switch to registry B, npm will now try to install the packages from registry B, instead of A.

Should that be pulled out into a separate RFC?

Lastly, how does the npm approach address item 2 in this RFC - the current approach leads to developers leaking their internal artifact repository sites to the public internet via yarn.lock if they have their company's artifact repository configured in a .npmrc or .yarnrc file.

ghost avatar Jun 02 '17 17:06 ghost

  1. I would go ahead and start a new RFC.

  2. How about still using public URLs in yarn.lock but if you have a private repository in the config the actual requests will have a URL replaced during runtime?

On 2 June 2017 at 18:12, Hutson Betts [email protected] wrote:

If you generated your package lock against registry A, and you switch to registry B, npm will now try to install the packages from registry B, instead of A.

Should that be pulled out into a separate RFC?

Lastly, how does the npm approach address item 2 in the RFC - the current approach leads to developers leaking their internal artifact repository sites to the public internet via yarn.lock if they have their company's artifact repository configured in a .npmrc or .yarnrc file.

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/yarnpkg/rfcs/pull/64#issuecomment-305854749, or mute the thread https://github.com/notifications/unsubscribe-auth/ACBdWMsAtLbSYcEla7kuxUYXHX58RA9uks5sAEKUgaJpZM4NU2W3 .

bestander avatar Jun 02 '17 17:06 bestander

:1. I would go ahead and start a new RFC.

:+1: I like the idea of a small, targeted, RFC that helps to bring Yarn in-line with npm behavior that, honestly, seems quite reasonable.

How about still using public URLs in yarn.lock but if you have a private repository in the config the actual requests will have a URL replaced during runtime?

Please bare with me as I verbalize this a little to see if we're on the same page.

I have registry=http://artifactory.example.com/artifactory/api/npm/npm in my .yarnrc configuration file that points to my company's super secret internal artifact repository.

However, when Yarn generates the yarn.lock file, Yarn uses the default yarn registry URL?

Later, when I run yarn, Yarn extracts the registry configuration option from .yarnrc, and re-writes the registry URL using my configuration override?

So technically the fetches happen against the registry URL I've specified, even though the yarn.lock specifies other URLs?

That behavior, if I'm understanding it correctly, lines up with the behavior of npm mentioned by @OpenGG.

That seems intuitive, since npm@5 does most of that already, and the aforementioned new RFC would bring Yarn inline with that behavior already.

The only divergence is that npm@5 writes out the override from .npmrc to the package-lock.json file.

Therefore, the only change would be that the override in .yarnrc would never be applied to the default when writing out unscoped packages to yarn.lock. I assume, too, that would apply to scopes?

ghost avatar Jun 02 '17 17:06 ghost

Yes, that seems correct, thanks a lot for helping out!

On 2 June 2017 at 18:29, Hutson Betts [email protected] wrote:

:1. I would go ahead and start a new RFC.

👍 I like the idea of a small, targeted, RFC that helps to bring Yarn in-line with npm behavior that, honestly, seems quite reasonable.

How about still using public URLs in yarn.lock but if you have a private repository in the config the actual requests will have a URL replaced during runtime?

Please bare with me as I verbalize this a little to see if we're on the same page.

I have registry=http://artifactory.example.com/artifactory/api/npm/npm in my .yarnrc configuration file that points to my company's super secret internal artifact repository.

However, when Yarn generates the yarn.lock file, Yarn uses the default yarn registry URL?

Later, when I run yarn, Yarn extracts the registry configuration option from .yarnrc, and re-writes the registry URL using my configuration override?

So technically the fetches happen against the registry URL I've specified, even though the yarn.lock specifies other URLs?

That behavior, if I'm understanding it correctly, lines up with the behavior of npm mentioned by @OpenGG https://github.com/opengg.

That seems intuitive, since npm@5 does most of that already, and the aforementioned new RFC would bring Yarn inline with that behavior already.

The only divergence is that npm@5 writes out the override from .npmrc to the package-lock.json file.

Therefore, the only change would be that the override in .yarnrc would never be applied to the default when writing out unscoped packages to yarn.lock. I assume, too, that would apply to scopes?

— You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub https://github.com/yarnpkg/rfcs/pull/64#issuecomment-305858794, or mute the thread https://github.com/notifications/unsubscribe-auth/ACBdWNV4iD7Q_pUeqmq3LdXDTbvSSv0eks5sAEaIgaJpZM4NU2W3 .

bestander avatar Jun 02 '17 17:06 bestander

And thus how do we support the use of multiple registry then? For example for packages deployed on a tertiary public repository?

One of the important point of the RFC was that we want the yarn.lock to contain coherent information, for example for community users of a company developed application. Example:

  • if a package from the public yarn/npm registry is installed normally, its resolved field would contain the yarnpkg url (which could be overridden at runtime as discussed by @destroyerofbuilds just above)
  • but if a package is available only from a tertiary repository, you don't want to have the yarnpkg url in the yarn.lock, it doesn't make any sense, since the package does not exists on npm/yarnpkg.

So how do you differentiate them? I see some ideas:

  • using scoped registry as with npm (@myscope:registry=https://myownregist.ry/packages/) will make the registry be written in the yarn.lock, but then it would be not so coherent with the way the registry setting, as detailed by @destroyerofbuilds above, would work. So it's not a good solution IMHO.
  • using a parameter to the add command and/or in the version field of the package.json (something like version@url for example, in a way it is a bit similar to using urls directly or github url, but with yarn resolution involved in the process)
  • having another configuration in the user settings file to map some package and/or scope to a specific repository, to be used as-is in the lock file.

For the record, this is how maven tackles this situation (and maven has a lot of this kind of use with multiple repository containing different packages):

  • you only record the name and version of a package to identify it (and there is no lock file because there is only fixed exact versions used)
  • you have a set of repository configured, and they are all queried for the package until one is found (so it means there is not hash or whatever to ensure the downloaded package is actually the same as before, this is often a problem with these use case with maven!)
  • you also have a mechanism to specify mirrors: you map a repository (by its id, because repositories have id, but it would be the same to refer to it via its url) to a mirror, and maven will replace at runtime the url of the first by the second (a bit like the currently discussed mechanism by @destroyerofbuilds above)

If we were to make a parallel to the present discussion:

  • we have repository urls written in the lock file, so we know for each package what is its original canonical repository as well as its hash
  • we would need a way to map a repository (expected to be present in a lock file) to another one: the use case is mirrors
  • we would need a way to specific a repository for a package or a set of packages (scope): the use case is tertiary repositories

This is just some food for thought for whoever write another RFC on the matter, not a fully thought solution :)

victornoel avatar Jun 04 '17 09:06 victornoel

great points, @victornoel!

bestander avatar Jun 15 '17 02:06 bestander

This discussion is a bit fractured (here, here, here, and here), so I've attempted to summarize my interpretation of it here.

Purpose of the resolved field

As far as I can tell, this field serves two purposes:

  • caching the URL for the package, to avoid having to calculate it again
  • caching the exact version of the package (i.e. the hash) as reported by the registry

We still need the latter functionality of course. It's not clear to me whether we need the URL or not. There was a comment here suggesting we could recalculate the URL at runtime, but keeping the cached URL may have performance advantages.

Problems with the current resolved field

  1. The registry in the lockfile doesn't always match the registry currently in use.
    • This results in changes to the lockfile after an install
    • This forces Yarn to resolve that version all over again, which slows down the installation process and could produce an inconsistent result
  2. It's not clear which registry should be used
    • In most cases, the registry specified by the current user's registry setting should be used, falling back to the default registry
    • In some cases, packages should be downloaded from alternate registries (e.g. private packages)
  3. The lockfile shows which registry was used by the user or system that generated it
    • This is misleading and confusing
    • This could leak information that users would prefer not to share

npm's solution

  • They effectively ignores the registry specified in the resolved field. This solves the 1st problem.
  • They use the registry configured by the current user, falling back to the default registry. They also allow you to use alternate registries for specific scopes. This solves the 2nd problem
  • The 3rd problem remains unaddressed.

Yarn's solution

In the short term (i.e. pre-v2.0), the approach taken by npm seems like the best route to take. PR #84 addresses that, but I think we should adjust it to match npm's solution more closely. I might put up a separate PR to discuss that further.

If we're willing to consider making a breaking change, then we should remove the registry from the resolved field. Removing it would simplify and shorten the lockfile, eliminate the potential for confusion, and and avoid leaking potentially secret information. It would solve the 3rd problem mentioned above.

I can't think of any downsides to doing this, aside from the fact that it would be a breaking change. There are breaking changes to the lockfile being considered anyway (see here). If those changes go ahead, it would be worthwhile making this change as well. If not, then I'm not sure it's worth the effort.

Gudahtt avatar Aug 19 '18 23:08 Gudahtt

After further consideration, adopting the behavior of npm v5 might be a breaking change after all (as mentioned here).

What I had in mind was for Yarn to do as npm v5 does, which is ignore the registry specified in the resolved field. The registry used would be the one specified in .yarnrc, or the default registry if that was not set.

Yarn already supports custom registries and even scoped registries via .yarnrc. If they were used over the yarn.lock entries, our problem would be solved. That would be a lot simpler to use than the proposed solution, which is to add an additional override-registries config setting.

Unfortunately, it is possible that people are relying upon the current behavior. The most common situation in which you'd want to use an alternate registry for a select group of packages is for internal packages, which should be scoped under an organization name. Scoped registries address that use case quite well. But if anybody out there is relying upon alternate registries for non-scoped packages, that behavior would break if we were to make .yarnrc registry override the registry in the resolved field.

That is, Yarn would have to drop support for alternate registries for non-scoped packages, which are currently supported.

This situation is a bit muddy, because I don't know that support for alternate registries for non-scoped packages was intended, or whether it works that way incidentally. I also don't know if anybody would care if that behavior was lost, whereas it is abundantly clear that a resolution to this problem would be appreciated.

Gudahtt avatar Aug 20 '18 04:08 Gudahtt

If we're willing to consider making a breaking change, then we should remove the registry from the resolved field. Removing it would simplify and shorten the lockfile, eliminate the potential for confusion, and and avoid leaking potentially secret information. It would solve the 3rd problem mentioned above.

Yes! That would be great. It's actually already planned for the 2.0, but noone started working on it yet: https://github.com/yarnpkg/yarn/issues/5892

Scoped registries address that use case quite well. But if anybody out there is relying upon alternate registries for non-scoped packages, that behavior would break if we were to make .yarnrc registry override the registry in the resolved field.

For scoped packages, I think the expectation is that those users would just have to share the same configuration. They can easily do this by checking in a yarnrc in the repository that would only contain the scrope registries.

For non-scoped packages, it never was supported in the first place (after all the lockfile has a big "DO NOT EDIT" notice at the very top) and I don't think we should try to support it. Especially since this feature is scheduled to be released with a major version bump.

arcanis avatar Aug 20 '18 05:08 arcanis

For non-scoped packages, it never was supported in the first place (after all the lockfile has a big "DO NOT EDIT" notice at the very top) and I don't think we should try to support it.

Fantastic

Edit: Though... this could be accomplished without editing the lockfile. Installing a package, changing the registry, then installing another package could result in a lockfile with alternate registries for non-scoped packages.

That seems like a bit of an edge case though.

Especially since this feature is scheduled to be released with a major version bump.

Removing the registry from the lockfile is scheduled for then. I was hoping we could let the registry in .yarnrc override the one in the lockfile sooner than that, as a non-breaking change.

Gudahtt avatar Aug 20 '18 11:08 Gudahtt

I was hoping we could let the registry in .yarnrc override the one in the lockfile sooner than that, as a non-breaking change.

We already have kinda-breaking changes in master (the new integrity field which, without being breaking per-se, will cause a bunch of changes into most lockfiles), so I think it would be better for the next release to be the 2.0, ideally.

arcanis avatar Aug 20 '18 13:08 arcanis

it would be better for the next release to be the 2.0, ideally.

Well, that's exciting! That leaves us with little time to update these RFCs. I've just put up a PR to rewrite the existing PR to more closely match npm@5's behavior: #99 .

As for this PR... I did find a downside to removing the registry from the resolved field. Or rather, I looked into why npm chose to preserve it.

They are using the resolved field to cache the tarball URL. The registry is an important part of that cache, because different registries might server the same tarball from different paths. The field is no longer useful as a cache without the registry.

It looks like npm requests metadata from the registry if the registry doesn't match the resolved field, meaning there's an extra round-trip involved in downloading the package. Extra round-trips are not good.

What is the impact to Yarn if we stop using resolved as a cache? Is that information cached somewhere else already perhaps? If not, perhaps we should make sure it is before removing it from the lockfile.

Gudahtt avatar Aug 21 '18 02:08 Gudahtt

Hmm I'm not entirely sure I follow - we currently have these informations: registry configuration + resolved field (registry + url). Assuming the registries from the resolved fields should always match the registry configuration (or be updated to match it), then this information is strictly redundant, and even if we were to use it as a cache key (I don't think we do, iirc the cache key is based on the package name + package version) we would be able to reconstruct it at runtime.

Does that make sense?

arcanis avatar Aug 21 '18 06:08 arcanis

After tracing the install process taken by Yarn, it looks like it does use that field as a cache. I can't see how it would construct it without asking the registry for metadata.

Here is an example of the URL we're discussing: https://registry.yarnpkg.com/@gulp-sourcemaps/map-sources/-/map-sources-1.0.0.tgz. This is stored in the resolved field, along with a hash.

In the resolve step of an installation, Yarn will collect a manifest for each dependency being requested. This manifest always includes a remote field, which itself has a reference field. This is the same URL as the one stored in the resolved field, and is later used by the fetcher during the fetch step.

The resolve step constructs a PackageResolver, which creates a PackageRequest for each dependency requested. This PackageRequest ends up constructing a registry resolver to gather the information needed to download the package from a registry (at least for registry requests). If the lockfile exists, the information stored there is used. Otherwise, a request is sent to the registry to obtain the package metadata, including the URL in question.

That is the round-trip that I am concerned we might be adding to each installation if we remove the field altogether, without caching that information elsewhere.

Edit: I should add that conceptually I still agree that the field should be removed. But assuming I'm right about there being a performance impact, we would have to accept that cost.

We could cache that information somewhere else instead; either as part of this change, or at a later date.

Gudahtt avatar Aug 21 '18 12:08 Gudahtt

I've done a bit more investigation on the option of generating the tarball URL using the standard format (${registry}${packageNameWithScope}/-/${packageNameWithoutScope}-${packageVersion}.tgz).

It looks like this approach was taken by pnpm, but it led to bugs. Apparently, not all tarballs are served from this URL. Some tarballs have non-standard URLs for some reason (e.g. https://registry.npmjs.org/esprima-fb/-/esprima-fb-3001.0001.0000-dev-harmony-fb.tgz), and others are served from different paths altogether (as demonstrated here). Apparently private registries such as npm Enterprise use different URL path structures.

The solution taken by the pnpm team was to only cache the tarball URL if it differs from the standard format (details here). That solution works fine for most cases, if you don't switch registries. If you do, then you could end up trying to use a generated tarball URL to download from a registry that uses an alternate tarball path.

Ultimately, I don't see any straightforward way to remove the tarball URL from the lockfile without hurting performance. It serves to make the installation process significantly faster if your registry matches the registry used in the cached (which it does for most users, most of the time).

At best, we could omit the tarball URL in certain known cases (i.e. known registries like registry.yarnpkg.com or registry.npmjs.org) can calculate them instead, but only if we had a strategy for dealing with the anomalous cases. That would mean hard-coding behaviors for specific registries, and perhaps letting the user opt-in to using auto-generated URLs (e.g. an artifactory user could ask for URLs to be generated just like they are for npm Enterprise). None of those seem like great options.

Gudahtt avatar Aug 22 '18 01:08 Gudahtt