gitness icon indicating copy to clipboard operation
gitness copied to clipboard

Report files changed to limit testing scope

Open donny-dont opened this issue 9 years ago • 84 comments

Currently we have a long running test suite for a product. This gets enough commits that running the full suite would back up the build server. To limit that it would be nice to get back what files are changed.

As a contrived example usage assume that we have 3 sections to the app, A, B, and C, and each of these has a corresponding test suite. If any files within A's space change then just the A test suite needs to run.

This will allow for quicker turn around time for any functional/integration tests.

pipeline:
  frontend:
    image: node
    commands:
      - cd app
      - npm run test
+   when:
+     changeset:
+       includes: [ **.js, **.css, **.html ]
  backend:
    image: golang
    commands:
      - go build
      - go test -v
+   when:
+     changeset:
+       includes: [ **.go ]

+changeset:
+  includes: [ **.go ]

donny-dont avatar May 18 '15 18:05 donny-dont

The same idea in another perspective, based on directories that have been changed, vs file extension.

tonglil avatar Apr 24 '17 23:04 tonglil

I think there two interesting use cases here, from what I can tell:

  1. the ability to support mono repos, where a single repository consists of multiple loosely coupled projects. The team effectively wants to treat each sub-project separately, with its very own drone.yml file, or at a minimum its own set of steps
  2. the ability to limit test execution to reduce build time for a standard repository, where the term standard is used to mean "not a mono repository". I think, in general, we can probably agree here that the goal is reducing test execution time, and limiting test execution is perhaps one solution. Another solution might be running all tests, but with parallelization to reduce execution time.

I think perhaps we need to step back and fully discuss the problems we are trying to solve. The solution (described in this issue) may very well be the correct solution. But it would help to fully document the different problems / use cases we are trying to solve before we settle on the solution(s).

bradrydzewski avatar Apr 25 '17 11:04 bradrydzewski

As it might help someone in the mean time, this is what I use as a workaround right now: - if [[ "${DRONE_PULL_REQUEST}" ]]; then CHANGEDPATHS=$(git --no-pager diff --name-only FETCH_HEAD FETCH_HEAD~1); else CHANGEDPATHS=$(git --no-pager diff --name-only HEAD~1); fi and then filter on the CHANGEDPATHS variable.

mrueg avatar Apr 25 '17 12:04 mrueg

I guess I can toss in two example problems that we have:

  • A mono repo with various package manifests in it. I haven't bothered setting it up with Drone since it'd need to build every AMI unless I do some of the path stuff like @mrueg is talking about.
  • Ditto for a Dockerfile repo of related Dockerfiles.

This can already be worked around, but it'd be good to either document how to do so or to figure something better out. I don't have strong preferences for either approach.

gtaylor avatar Apr 25 '17 15:04 gtaylor

So do we need the ability to limit steps by files or folders changed? Or load different yamls based on files or folders changes? Or both?

A mono repo with various package manifests in it. I haven't bothered setting it up with Drone since it'd need to build every AMI unless I do some of the path stuff like @mrueg is talking about.

I get the sense that each sub-project in a mono-repo might want to manage its own yaml file and possibly its own secrets, but I'm not entirely sure.

bradrydzewski avatar Apr 25 '17 16:04 bradrydzewski

For our usage case, I'd be pretty happy limiting steps by file/folders changed. We could load different yamls based on changes, but that could get to be a confusing situation pretty quickly as far as variable/secret surface area.

gtaylor avatar Apr 25 '17 16:04 gtaylor

What @mrueg works for build steps, but would not for plugins since they don't execute bash.

If conditions can be plugin-fied, that would really enable whatever condition you want to compare on (filename extensions, changeset directory, regex on git tag names, obtaining a lock, making a request out to some external api, etc... In addition to the conditions that already exist today). As long as the plugin returns 0 to continue, or 1/anything else for skip.

tonglil avatar Apr 25 '17 21:04 tonglil

Written up my thoughts and ideas on handling monorepos.

Proposal for how Drone could handle monorepos

Summary

By a monorepo I mean a single repository that contains multiple projects. As the number of projects in the repo grows or if the projects have very expensive builds it becomes very cumbersome to rebuild all projects on every commit.

So at the very least for a monorepo you would want to some how filter what build / build steps were triggered by a commit.

Approaches

Keep current .drone.yml but provide a changeset filter for pipeline steps

As suggested in https://github.com/drone/drone/issues/1021#issue-77762241 add an include/exclude for pipeline steps

Pros

  • Small amount of changes to current Drone architecture
  • Doesn't add new concepts to Drone and is very simple idea to understand
  • Also covers the non-monorepo case but where you don't want to run all build steps all the time e.g. for expensive test suites
  • If the projects in the repo share build steps you can easily reuse them (providing the ordering is the same)

Cons

  • Could quickly end up with unwieldy .drone.yml files
  • Harder to read .drone.yml with lots of conditional logic to think about
  • Possibly more merge conflicts as people working on different project could be editing .drone.yml concurrently

Allow for multiple .drone.ymls per repo

In this case Drone needs to know where the the sub .drone.yml files are. It also needs to know when to run builds for those sub ymls. And if we want to handle renaming / moving of projects within the repo we also need a way to track the sub ymls i.e. an id that never changes and refers to the sub .drone.yml

Root .drone.yml contains location of sub configs (MY PREFERENCE)

Store the metadata for sub .drone.ymls in a root .drone.yml (perhaps with a different name e.g. .drone.root.yml)

Data per sub yml

  • An id for the sub yml
  • a display name
  • the location in the repo of the sub yml
  • An include array of paths that should trigger the sub yml
  • An include array of paths that should not trigger the sub yml

Pros

  • The drone server does not need to hold any new configuration that may well change over time / across branches
  • Simple to always find a repo's builds, just look at the root yml
  • Easily tell which files each project is concerned about
  • Each sub yml should remain identical to current .drone.yml

Cons

  • ~~if you have a lot of sub ymls then perhaps the parsing on each commit could get slow?~~
    • Did a test with grep against 867,374 paths (all my system's paths) with the pattern '.*/.*\.js' returning 27,857 paths, the time taken was only 60ms
  • Easier to get merge conflicts as people working on different projects need to edit one file?
  • Might use same id on two different branches and drone would treat them as the same project?
    • Don't think there is anyway around this type of problem, even if you use path to sub yml you can have a similar issue (although less likely)
  • Introduces a new concept, this new "root" config file

Store sub yml metadata on drone server

Data per sub yml

  • id for the sub yml
  • relative path of the sub yml
  • Display name for the sub yml?
  • Include/Exclude paths?

Pros

  • Since data is on the server it should always be fast and easy to look up

Cons

  • What happens when a yml location moves? Server needs to be aware and update db record
  • What happens if the yml has moved on one branch but not on another?
  • Puts state / config about what / when to build into Drone server, now it is not so simple to look at a repo and know what is being built

Implicit scoping of sub yml files to their children directories

All .drone.yml files in a repo are picked up and builds will be run for them. Only run the build if a file in the same directory or in a descendant directory has changed in this commit

Pros

  • No extra config or state in Drone server
  • Simple concept

Cons

  • Lacks flexibility, might end up wanting to put lots of .drone.ymls in a base directory to cover 2 of the sub-directories but would implicitly cover all the sub-directories, perhaps quite a lot of them
  • If a sub yml's location is changed how do we track build history?
    • Could have an id in the sub yml just like for root yml approach
  • Can't have an unused .drone.yml in the repo, it would be picked up always

Scan for and read all sub yml files on each commit

As above but without the implicit scoping

Pros

  • No extra config or state in Drone server
  • Simple concept in the simple case

Cons

  • Complex now if you want to find what might trigger a build, need to check every yml in the repo to see if the includes/excludes cover a file/directory
  • If a sub yml's location is changed how do we track build history?
    • Could have an id in the sub yml just like for root yml approach
  • Can't have an unused .drone.yml in the repo, it would be picked up always

jamesk avatar Apr 26 '17 10:04 jamesk

Good ideas @jamesk; I like the explicit subdir-yml file line of thinking as this forces people to be aware and maintain yml files, keeping them tidy, and being very declarative for Drone. Implicit reading of yml files can become confusing and becomes hidden/documentation knowledge.

  1. What would reading dir specific yml file mean for the working dir though? Should the working dir be that subdir the yml file is in, or still the root dir of the repo?

  2. Any reason for not using a plugin container to control conditions?

tonglil avatar Apr 26 '17 17:04 tonglil

@tonglil Yeah I prefer the sub yml approach as well.

In Response:

What would reading dir specific yml file mean for the working dir though? Should the working dir be that subdir the yml file is in, or still the root dir of the repo?

In the usual case I think you would want the working dir to be where the sub yml is. Perhaps though an extra parameter can be added to the workspace block to specify the default working dir separately from the clone location.

Any reason for not using a plugin container to control conditions?

I was thinking that each "pipeline" i.e. sub yml would get its own build history in Drone, probably by adding an extra field to the unique key for builds i.e. build_number, build_repo_id AND pipeline_id.

I thought it would be good to behave like the branch filtering, not even creating a build if the commit is not relevant. Github for instance has no way of deleting commit statuses, so if you first made a build I think you'd have to create X pending build statuses where X is number of projects in the monorepo, then later mark them each as successful (even though you ended up not running anything in them).

For this reason I was thinking along the lines that the main Drone server should be able to easily and quickly make any decisions around whether to trigger a build like it does for branch filtering i.e. based purely on the contents of yaml config and info about the commit. My understanding is that if it was left to a plugin we would need to spin up the plugin docker image, which would usually only be done on the drone agents as part of a build?

jamesk avatar Apr 26 '17 20:04 jamesk

Perhaps instead of sub-yaml we define sub-projects in Drone? It is conceptually similar, but perhaps more general purpose and might support more use cases. For example:

  1. drone/drone has the default sub-project which builds, tests and publishes Drone
  2. drone/drone has a deployment sub-project with a completely different yaml and set of secrets concerned with production deployment

So basically instead of repository -> build it would be repository -> project -> build

Every repository would have a default project, with the ability to create additional projects. So the visual / user experience would not change for single-project repositories (you would see just the default). The complexity of multiple projects would only be exposed to those that need it.

bradrydzewski avatar May 05 '17 12:05 bradrydzewski

This sounds potentially confusing to explain to the users. Repeating ourselves in .drone.yml isn't sexy, but I'd almost prefer verbosity and explicitness there rather than starting down this more complicated path. At least at this stage in Drone's development (pre-1.0).

gtaylor avatar May 06 '17 00:05 gtaylor

This is definitely more of a 2.0 change. I think there is some precedence here now that gitlab supports subgroups which is conceptually similar: https://docs.gitlab.com/ce/user/group/subgroups/

Reference request to support subgroups in drone https://github.com/drone/drone/issues/2009

I share the concerns about the UX but think this is something that could be overcome. The UI hides the fact that technically every build is a matrix build, for example. But I agree that we should limit the initial implementation to changeset analysis in the when clause. It provides an immediate (albeit more verbose) solution, which we can always build upon. And it prevents us from making a more complex change that cannot be undone once implemented ...

I think we can perhaps split these into a 1.0 proposal and then a 2.0 design document that evolves over time as we observe the 1.0 behavior. I see this being similar to the golang proposal process which is definitely more long term focused https://github.com/golang/proposal/tree/master/design

bradrydzewski avatar May 06 '17 00:05 bradrydzewski

I think a 2.0 change is totally fair.

The verbosity and explicitness is becoming an issue with larger projects/repositories, where it becomes very cumbersome to parse though a .drone.yml approaching 1k lines to figure out what is happening when / how to configure and modify builds. What subgroups would help with is allow independent subtrees define their own workflow in a way that separates concerns for monorepo-like repositories.

tonglil avatar May 08 '17 23:05 tonglil

I think you're overcomplicating it. the git plugin just needs to do the same thing as concourses git-resource, which is emit a truthy when new commits contain specified globs.

if a repo has many different projects, then sure.. specify their own drone.yml... but that's really a different topic. (installing a directory subset from a repo as a drone project)

airtonix avatar Oct 25 '17 05:10 airtonix

Perhaps the two systems are architected a bit differently. In drone, the yaml file is fetched using the GitHub API and then pre-compiled to an IR and executed. This happens before any code is cloned and creates a bit of a chicken and egg issue. Under different design constraints a more simple approach like this would work.

bradrydzewski avatar Oct 25 '17 05:10 bradrydzewski

@bradrydzewski it's the same in concourse, you install the pipeline before any code is cloned. concourse does this via cli client.

But the point I'm making is that here in this ticket, it has turned from "we need the pipeline to continue or not based on the presence of files in the commit" to "lets change how a project is setup".

that second topic... it's a whole other feature request.

airtonix avatar Oct 25 '17 05:10 airtonix

also, in concourse I can have as many or as little pipeline.yml files all over my repo, but some tasks in a pipeline are always going to need file glob based conditionals regardless.

airtonix avatar Oct 25 '17 05:10 airtonix

also, in concourse I can have as many or as little pipeline.yml files all over my repo, but some tasks in a pipeline are always going to need file glob based conditionals regardless.

The current version of Drone does not have guaranteed access to the file system of a running build. It is possible for Drone to launch builds on a remote machine using the Docker remote API or the Kubernetes API. This is a design constraint that needs to be taken into consideration.

But the point I'm making is that here in this ticket, it has turned from "we need the pipeline to continue or not based on the presence of files in the commit" to "lets change how a project is setup"

I think the goal is to understand the different use cases and developer workflows that we need to support. This issue (not criticizing the op) proposes a solution, however, we cannot be sure it is the best or optimal solution until we gather such information. For example, one solution to the ops issue (slow tests) could be parallelism and a solution to mono-repositories could be multiple yaml files; neither of which require diffs or file globs. Or we could come to the conclusion the ops original proposal is the preferred solution. So far I think this thread is generating some good discussion.

I think you're overcomplicating it.

This is also a possibility. There is certainly a benefit to a fresh pair of eyes taking a fresh approach. Have you considered creating a patch to demonstrate how this could be solved in a more simple way?

bradrydzewski avatar Oct 25 '17 05:10 bradrydzewski

Also since we are gathering use cases and information, it would be great if you could prove more context. What underlying problems or use cases are you hoping to solve with globs?

bradrydzewski avatar Oct 25 '17 06:10 bradrydzewski

I'm in the midst of working on a pair of repos that could benefit from steps that are conditionally ran on changes to certain file patterns. Here's what it looks like for both repos:

  1. The repo contains a number of sub-directories, each being separate but related.
  2. One of the directories is a re-usable library.
  3. The other 5+ are things that use the library.
  4. In addition to the other 5 directories using the library, we build and push a compiled object to an S3 repo whenever the single re-usable library dir is modified.
  5. The other 5 directories are compiled, published to S3, and also get pulled into a Docker build+publish.

As such, it'd be great if we could only build one of the 5 sub-dirs if they get touched.

It is true that there are other ways around this, in breaking the larger repo up into more numerous repos. However, the team in question already has established workflows and has quite a few non-pure-engineer contributors that value the simplicity. It can be significantly more difficult (given their backgrounds and skillsets) to manage six separate repos with their own build/deploy flows instead of one.

If I was able to do something like:

when:
  pathsChanged: ['some_dir/*']

We'd be pretty stoked and would have an easy way for our teams that use more bulky repos to proceed.

gtaylor avatar Oct 25 '17 15:10 gtaylor

I've tried to nail it this way:

pipeline:
  somtest:
    image: debian:buster
    commands:
      - echo "Found changes in:" | tee gitchanges.txt
      - echo $(git show --pretty="format:" --name-only $(git rev-parse HEAD) | cut -d '/' -f1 | sort | uniq | xargs) | tee -a gitchanges.txt
      - some_script_with_logic gitchanges.txt

Not yet working as i'd like, sometimes if i delete folder/a.file and add folder/b.file git returns R100 instead of A and D. That stuff is strange.

But it was fast try, gonna rtfm git again.

yellowmegaman avatar Oct 25 '17 17:10 yellowmegaman

BTW got exactly same problem, team uses monorepo, one folder is base, others are projects. All software written in scala, so building modules separately is best for us, since sbt tool for building is really slow.

And i also have a repo with a bunch (100+) ansible roles, which i'd like to be able to test on docker/vm's, one change triggering them all would be a kill.

yellowmegaman avatar Oct 25 '17 17:10 yellowmegaman

@bradrydzewski The use case is that you have a repo whose parts thereof are not built as one thing, different parts of the repo may be built in different ways. Lets look at a typical repo I deal with at Fusion:

$ ls -al ./
./repo/
  client/
   src/
    styles/
      main.scss
  server/
    something.sln
    ccnet.build
    foo.ps1
yarn-cache/
Dockerfile
docker-compose--build.yml
docker-compose--build.yml.tmpl
docker-compose.yml
docker-compose.yml.tmpl
package.json
.eslintrc
.yarnrc

There are three major things going on here:

  1. frontend builds that run in a container using ./client/**/* as input and outputs into ./server/static/theme
  2. server builds that are only triggered by changes in ./server/**/* and run on a window server under cruise control then onto octopus deploy
  3. docker image creation that works off a collection of glob patterns similar to the frontend.

All three of these job pipelines should only be triggered when the respective files change. So for example we don't want a cruise control build running when none of the files under ./server change.

Luckily, cruise control has directives around file exclusion triggers. We've solved that.

At the moment, I'm using concourse where resource changes (or manual ui button click) triggers pipeline runs. In our pipeline, I describe the various parts of the repo as git-resources each having a list of one or more optional glob patterns that refine when the git resource emits a new version which in turn triggers a linked pipeline run.

resource-types:
  - name: git-branches
    type: docker-image
    source:
      repository: vito/git-branch-heads-resource
      tag: latest

resources:
  - name: frontend-source
    type: git-branches
    source:
      uri: "{{project_repo_uri}}"
      private_key: {{bitbucket_ci_private_key}}
      branch: develop
      paths:
        - client/**/*
        - package.json
        - yarn.lock
        - .yarnrc
  - name: docker-image-source
    type: git-branches
    source:
      uri: "{{project_repo_uri}}"
      private_key: {{bitbucket_ci_private_key}}
      branches: 
      - master
      paths:
        - docker/**/*
        - yarn-cache/**/*
        - package.json
        - yarn.lock
        - .yarnrc

Above you can see, two resources frontend-source and docker-source. I've left out my third scenario because as a concept its just more of the same...

All concourse git resources get new commits, but they only emit changes and trigger on when the commit contains paths that match the globs described.

further on in the pipeline I wire that up like so :


jobs:
  - name: build-image
    public: true
    serial: true
    plan:
      - do:
        - get: docker-source
          trigger: true
          params: {depth: 1}
        - task: get version
          config:
            platform: linux
            image_resource:
              type: docker-image
              source: {repository: realguess/jq}
            inputs:
            - name: docker-source
            outputs:
            - name: version
            run:
              path: sh
              args:
              - -exc
              - |
                jq '.version' --raw-output < ./docker-source/package.json > version/tag
                echo "Sniffing Version $(cat version/tag)"
        - put: image-builder
          params:
            build: docker-source
            dockerfile: docker-source/docker/Dockerfile
            tag: version/tag
            tag_as_latest: true

This job pipeline only runs when the docker-source resource emits new versions which only happens when new commits contain whitelisted files.

Now the important part is my pipeline file is called concourse.yml... but I can call it what ever I want, it can be where ever I want... google drive, the repo, stored in gmail who cares... because concourse only creates pipelines as a result of pushing the yaml file through its cli tool fly so I could technically describe my three scenarios above as three separate files (but then i'd miss out on being able to link them together).

Now the reason I'm here annoying you is that concourse is awesome... like seriously awesome stuff, but you really have to be a lover of the command line to get anything done with it. It doesn't have the nice UI that drone has. I started making a ui for this... but at some point I want to have a life and the rest of the team here at Fusion isn't going to understand how to manipulate the command line to get new projects into concourse. I'm trying to decide if Drone is the tool that will save me from writing the custom UI for concourse. without path filters though... it doesn't seem likely.

Drone is really nice, the UI is great, so even if you implement a feature where installing a project lets you detect nested drone.yml files, tasks within each of those files will still need to restricted to only changes from certain files while still being able to join output back into other tasks.

airtonix avatar Oct 27 '17 04:10 airtonix

Just to add another use case, pretend this is the pipeline:

- install deps
- compile
- build + push docker image -> latest
- deploy latest

If that team has issues w/ their drone build pipeline, they are going to be editing .drone.yml lots of times. However, the app isn't being changed.

In this instance, the first 3 steps of the pipeline can be completely skipped by matching on src/* file change condition, probably saving 80% of the time it takes for the pipeline to run -> less compute resources.

This also applies if the repo contains docs that get published somewhere, or assets that are uploaded. There's no need to build src/ if only docs/ or assets/ change, and vice versa.

tonglil avatar Nov 02 '17 16:11 tonglil

I tried to read through the concourse definition @airtonix posted, but it's too complex to parse IMO. However I agree with his statement above that there are 2 separate use-cases here:

  1. conditional glob matching
  2. subdirectory pipeline definitions

tonglil avatar Nov 02 '17 16:11 tonglil

I want to add my use case also. It's basically the same as @gtaylor

when:
  pathsChanged: ['some_dir/*']

because then, I could apply that to any plugins.

Right now, I'm doing a trick in a normal image (golang:1.9.2) and in the command, I use a version of @mrueg script. Basically, it checks the latest commit sha and diff it and gets the files changed. Then I grep over it and pass to the next step if it's not in it.

pipeline:
  build-base-image:
    image: golang:1.9.2
    commands:
      - git diff-tree --name-only ${DRONE_PREV_COMMIT_SHA} | grep -q Gopkg.lock || exit 0
      - … steps that should execute if the Gopkg.lock was modified…

This allows me to update the dependencies if they're not up to date yet.

But what I really want to do, is to build an image with those dependencies baked in as long as the Gopkg.lock is not updated. And I cannot use that trick on a plugins/docker image, because I cannot add commands to it from what I gathered.

dolanor avatar Jan 03 '18 14:01 dolanor

This would make our definite switch to Drone CI. Most CI systems have this and this makes it easy to have monorepos and trigger builds/deployments based on changed projects.

gaui avatar Mar 26 '18 19:03 gaui

Most CI systems have this

Which CI systems do support this?

tboerger avatar Mar 28 '18 19:03 tboerger

@tboerger TeamCity as an example. This issue in fact is the only reason why we still stick to TeamCity and not using Drone.

IharKrasnik avatar Apr 04 '18 10:04 IharKrasnik