arcade icon indicating copy to clipboard operation
arcade copied to clipboard

Configure repository cloaking

Open premun opened this issue 3 years ago • 11 comments

Context

As part of the tarball generation process, we are removing files such as binaries or non-required files under a license. We need to either cloak these files via source-mappings.json or bake this step into the VMR assembly process so that these files are in place for source build to work.

Goal

For this step of the tarball generation process, we need to find a counterpart in the VMR and decide when it will happen, then implement it.

premun avatar Aug 03 '22 13:08 premun

Question

If we remove things such as .dll from the code like we already do for the tarball (https://github.com/dotnet/installer/blob/8f639696e6d57fb09e03e89c6397d913de1231ed/src/SourceBuild/Arcade/tools/SourceBuildArcadeTarball.targets#L246-L265), we might break repo's ability to run tests in the VMR. Should we then really do it or just keep this step?

premun avatar Aug 08 '22 08:08 premun

I tried to set up cloaking for the VMR, adhering to the current removal rules from current tarball generation which goes as follows:

  1. Prepare a list of files marked for removal by listing all .dll, .Dll, .exe, .pdb, .mdb, .zip, .nupkg files
  2. From the list, remove all under these paths:
    • runtime\src\coreclr\.nuget\
    • runtime*\src\installer\pkg

This effectively preserves two PDB files, e.g. runtime\src\coreclr\.nuget\_.pdb.

Problem

Currently, we cannot do this through the cloaking system which has include and exclude lists. Git takes the include ones and then substracts the exclude ones, whereas here we want another layer - "exclude files from the exclude filter".

Proposed solution

To solve this, we could use the .gitattributes file and give the final say in what gets included to the hands of the individual repos. We could let the repo assign vmr-keep and vmr-ignore attributes which would override the globs from source-mappings.json.

From this, we could always only include files without the ignore attribute and then "not exclude" any with the keep attribute. For instance, we could resolve current problem like this:

dotnet/runtime > .gitattributes

src/coreclr/**/*.pdb       vmr-keep
src/installer/**/*.pdb     vmr-keep

VMR > source-mappings.json

{
    "defaults": {
        "defaultRef": "main",
        "exclude": [
            "**/*.dll",
            "**/*.Dll",
            "**/*.exe",
            "**/*.pdb",
            "**/*.mdb",
            "**/*.zip",
            "**/*.nupkg"
        ]
    },
    "mappings": [
        {
            "name": "runtime",
            "defaultRemote": "https://github.com/dotnet/runtime"
        }
    ]
}

This would give us a list of filters such as this:

  • :(glob,attr:!vmr-ignore)**/*
  • :(exclude,glob,attr:!vmr-keep)**/*.dll
  • :(exclude,glob,attr:!vmr-keep)**/*.Dll
  • :(exclude,glob,attr:!vmr-keep)**/*.exe
  • :(exclude,glob,attr:!vmr-keep)**/*.pdb
  • :(exclude,glob,attr:!vmr-keep)**/*.mdb
  • :(exclude,glob,attr:!vmr-keep)**/*.zip
  • :(exclude,glob,attr:!vmr-keep)**/*.nupkg

premun avatar Aug 08 '22 16:08 premun

@mmitche thoughts on the proposal above? Is this feasible?

premun avatar Aug 08 '22 16:08 premun

How does the cloaking relate to git as it's currently implemented today? Is git used to do the cloaking or are we implementing it ourselves?

The gitattributes idea is interesting. It does look like .gitattributes allows for arbitrary attributes to be applied to a file. It is a little non-intuitive that the inclusion/exclusion metadata would be specified across multiple files, though I think well placed documentation and comments could avoid confusion there.

The other option might be to say that inclusions always filter off the full possible file set, while exclusions remove off the current keep set, then use a list of globs to generate the patterns. So the equivalent would be:

[
  {
    "type" : "include",
    "pattern" : "**/*"
  },
  {
    "type" : "exclude",
    "pattern" : "**/*.dll"
  },
  {
    "type" : "exclude",
    "pattern" : "**/*.pdb"
  },
  {
    "type" : "exclude",
    "pattern" : "**/*.Dll"
  },
  {
    "type" : "include",
    "pattern" : "src/coreclr/**/*.pdb"
  }
]

mmitche avatar Aug 08 '22 18:08 mmitche

Question

If we remove things such as .dll from the code like we already do for the tarball (https://github.com/dotnet/installer/blob/8f639696e6d57fb09e03e89c6397d913de1231ed/src/SourceBuild/Arcade/tools/SourceBuildArcadeTarball.targets#L246-L265), we might break repo's ability to run tests in the VMR. Should we then really do it or just keep this step?

Yeah, that might be an issue. It also sounds like another issue we might run into; what if we find files that are required for a build of one platform but are prohibited under another? Basically, required in one scenario but disallowed in another.

In this case, it a dll is required for testing, but disallowed for Linux source build. Maybe this won't happen, maybe it will. Similar cases might be things like:

  • Binary required for testing but needs to be removed for Linux source build. Can it live in the VMR and be filtered later? Maybe instead we just say that X Y and Z tests won't be able to run in the VMR?
  • Source file required for Windows build has a license that does not allow its inclusion in Linux source build.

@MichaelSimons Do you think these scenarios may show up? Is it required that the repository that we hand to our partners already be filtered?

/cc @richlander

mmitche avatar Aug 08 '22 18:08 mmitche

I'm likely missing something basic. I was thinking that cloaking would be oriented on source not binaries. Can you describe the flow, both leading up to and including the cloaking?

richlander avatar Aug 08 '22 23:08 richlander

@richlander looking at how we generate the tarball today, it seems like there are 2 "fake" PDB files in dotnet/runtime that have to make it to the VMR for runtime to build. Example: https://github.com/dotnet/runtime/blob/main/src/coreclr/.nuget/_.pdb (just an empty file). Using MSBuild, it's easy to list all PDBs and then handpick these two out of there, but not as much with the git setup we have. It's only two files and their empty, so unlikely to change and we could just copy them to the VMR manually but I feel like we should still be able to have this kind of flexibility built-in in the VMR sync as more cases can come easily and we could avoid problems.

The other question about breaking repo tests without checked-in binaries is connected to things like unit tests that rely on some mock artifacts, e.g. Arcade SDK has targets for manipulating nupkgs and so the unit test projects have some nupkgs checked in which serve as test subjects. Example: https://github.com/dotnet/arcade/blob/main/src/Microsoft.DotNet.Build.Tasks.Feed.Tests/TestInputs/Nupkgs/test-package-a.1.0.0.nupkg

@mmitche How does the cloaking relate to git as it's currently implemented today? Is git used to do the cloaking or are we implementing it ourselves?

At the moment, we are actually super lucky, things are as simple as they can possibly be. We just build one git diff command and we call one git apply command and that's it. We are able to take the usual globs and create filters: https://github.com/dotnet/arcade-services/blob/fbaae548a629b404baef3c8c9683a595dce9241f/src/Microsoft.DotNet.Darc/src/DarcLib/VirtualMonoRepo/VmrManager.cs#L356-L376

Example command would be:

git -C "./runtime/.git" diff --patch --binary --output "./runtime.patch" \
4b825dc642cb6eb9a060e54bf8d69288fbee4904..7320d777078a9e73874308f30800eacbeb459368 -- \
    ':(glob,attr:!vmr-ignore)**/*'             \ # add all files (there could be several of these)
    ':(exclude,glob,attr:!vmr-keep)**/*.dll'   \ # substract using the exclude magic keyword
    ':(exclude,glob,attr:!vmr-keep)**/*.Dll'   \ # this example already includes the .gitattributes idea from above
    ':(exclude,glob,attr:!vmr-keep)**/*.exe'   \
    ':(exclude,glob,attr:!vmr-keep)**/*.pdb'   \
    ':(exclude,glob,attr:!vmr-keep)**/*.mdb'   \
    ':(exclude,glob,attr:!vmr-keep)**/*.zip'   \
    ':(exclude,glob,attr:!vmr-keep)**/*.nupkg'

What git does is it adds all pathspecs into a set, then substracts all exclude pathspecs from that set. So no layering there. It creates a pure regular diff patch that we then apply to the right path.

I like the proposal with an explicit ordered list of include/exclude operations but I don't see how to apply it on the git operations. Please note that this way git handles automatically files that were for example created in an excluded folder and after N commits were moved into an included one (and vice versa).

To be honest, I kind of like the idea of having the final say in the repo as the repo understands this concern well but it also leaks the knowledge about the existence of the VMR into the repo (though I believe we will not completely avoid that in the future anyway).

The above seems to work though I need a bit more time as I have some problems with git and applying this to a large set of files (it works fine on smaller sets, hopefully this is not a git bug).

premun avatar Aug 09 '22 08:08 premun

To be honest, I kind of like the idea of having the final say in the repo as the repo understands this concern well but it also leaks the knowledge about the existence of the VMR into the repo (though I believe we will not completely avoid that in the future anyway).

I think we're likely to have some info about "last synced sha" or other such info in the individual repos.

The above seems to work though I need a bit more time as I have some problems with git and applying this to a large set of files (it works fine on smaller sets, hopefully this is not a git bug).

Alright, given that this is simple, native git functionality, I think using gitattributes is a reasonable way to go until we know otherwise. If you start getting into other corner cases, it may be worth looking in a different direction.

mmitche avatar Aug 09 '22 14:08 mmitche

Alright, I made it work natively, without corner cases, seems to do just what we need.

What about the attribute names? We need two attributes - positive and negative, e.g. "ignore" and "preserve". They don't necessarily need to be "vmr" but can be something in the sense of "part of product".

~~It can also be one attribute and have different values (which might be better to make it clear only one is allowed). Example:~~

src/coreclr/**/*.pdb                              vmr=preserve
src/native/external/llvm-libunwind/LICENSE.TXT    vmr=ignore

EDIT: The diff filter when creating patches cannot do "attribute is not equal" so we cannot do !vmr=preserve or !vmr!=preserve but would have to do !vmr-preserve only unfortunately.

premun avatar Aug 10 '22 10:08 premun

@mmitche I wonder if this can eventually be a solution for the platform-specific files?

E.g.

platform-specific=osx,linux

It could then be quite easy to synchronize these files between the actual VMRs / branches so that for example downstream RedHat would have a simple way to pull upstream minus win/osx files? (just a quick idea)

It's also possible to create filters (sets of attributes) and assign those:

[filter "win"]
	vmr-preserve
	platform-specific-win

*.cs	filter=win

premun avatar Aug 10 '22 10:08 premun

@mmitche I wonder if this can eventually be a solution for the platform-specific files?

E.g.

platform-specific=osx,linux

It could then be quite easy to synchronize these files between the actual VMRs / branches so that for example downstream RedHat would have a simple way to pull upstream minus win/osx files? (just a quick idea)

It's also possible to create filters (sets of attributes) and assign those:

[filter "win"]
	vmr-preserve
	platform-specific-win

*.cs	filter=win

That's a possibility, but I I'd like to avoid different VMR content per platform if possible, until we absolutely need to.

mmitche avatar Aug 10 '22 14:08 mmitche

We have reach file parity between the tarball and the VMR after https://github.com/dotnet/installer/pull/14706 so closing this

premun avatar Oct 12 '22 13:10 premun