cartographer icon indicating copy to clipboard operation
cartographer copied to clipboard

Introduce RFC 18 Workload Report Artifact Provenance

Open waciumawanjohi opened this issue 3 years ago • 21 comments


name: RFC 18 about: Workload Report Artifact Provenance Readable


Changes proposed by this PR

Opens RFC 18

Release Note

PR Checklist

Note: Please do not remove items. Mark items as done [x] or use ~strikethrough~ if you believe they are not relevant

~- [ ] Linked to a relevant issue. Eg: Fixes #123 or Updates #123~ ~- [ ] Removed non-atomic or wip commits~ ~- [ ] Filled in the Release Note section above~ ~- [ ] Modified the docs to match changes~

waciumawanjohi avatar Jan 12 '22 18:01 waciumawanjohi

✔️ Deploy Preview for elated-stonebraker-105904 canceled.

🔨 Explore the source changes: 2dd4ee0d3518d727a7acbf2a9e843677c24fcb20

🔍 Inspect the deploy log: https://app.netlify.com/sites/elated-stonebraker-105904/deploys/61fb0b5bc32ed7000746e187

netlify[bot] avatar Jan 12 '22 18:01 netlify[bot]

Copying over my comments from RFC14 that are still relevant:

In the office hours this week (Jan 10th, 2022), we talked about how we could extend this type of artifact tracing to track artifacts across clusters.

Assuming you started with a SupplyChain that looked something like the following (from git-writer example):

[source-provider] -> [image-builder] -> [config] -> [git-writer]

But you made a small change and wrapped the git-writer in the SupplyChain with a ClusterSourceTemplate, it would let you emit the revision/url of the committed config. By doing this you would end up with an artifact representing the commit to the config-repo as the last item in the status of your workload.

So your workload status might end up looking like this:

artifacts:
  - source:
      id: 000
      url: https://github.com/my-org/my-source
      revision: abc123
      passed:
      - resource-name: source-provider
        apiVersion: source.toolkit.fluxcd.io/v1beta1
        kind: GitRepository
  - image:
      id: 111
      image: gcri.io/asdfsdfasdf
      passed:
      - resource-name: image-builder
        apiVersion: kpack.io/v1alpha2
        kind: Image
      from:
      - id: 000
  - config:
      id: 222
      config: shasum(of-some-config)
      passed:
      - resource-name: my-config
        apiVersion: v1
        kind: ConfigMap
      from:
      - id: 111
  - source:
      id: 333
      url: https://github.com/my-org/my-config
      revision: xyz789
      passed:
      - resource-name: config-writer
        apiVersion: carto.run/v1alpha1
        kind: Runnable
      from:
      - id: 222

Now lets assume you are following the basic-delivery example on another cluster.

You would end up with a Deliverable with the following status:

artifacts:
  - source:
      id: 333
      url: https://github.com/my-org/my-config
      revision: xyz789
      passed:
      - resource-name: config-provider
        apiVersion: source.toolkit.fluxcd.io/v1beta1
        kind: GitRepository
 - source: some other-artifact
   ...

You can see how we could correlate these two completely independent resources, based solely on their artifacts.

jwntrs avatar Jan 12 '22 19:01 jwntrs

Copying over my comments from RFC14 that are still relevant:

@rawlingsj made a really insightful comment in another unrelated thread that we should consider here:

Customers will likely not be willing to give access from a dev cluster (and developer users) to others such as staging / production. Instead it will probably be easier to send events out of those locked down clusters to a centralized place for users to get visibility into their environments.

My comment here definitely assumes read access to all clusters. So maybe we should start thinking about writing out artifacts as k8s events at the same time as writing them to the workload/deliverable status. That way if someone is already emitting k8s events from their prod clusters, we can integrate into their existing tooling.

jwntrs avatar Jan 12 '22 19:01 jwntrs

But you made a small change and wrapped the git-writer in the SupplyChain with a ClusterSourceTemplate

@jwntrs I think perhaps this thinking is solid, and should be enforced, that all steps, no matter what, produce an artifact. Even in the case of a kpack deploy, we should have a kpack deploy ID or something of that nature, as an artifact. That way, every Supply Chain Resource has representation in the artifacts list.

squeedee avatar Jan 13 '22 17:01 squeedee

@scothis

I'm in favor of capturing a reference to the resources created for a Workload on the status, but I'm skeptical that capturing the output values will give users the information they think they are getting. There's little point in merely duplicating state that could instead be looked up on the child resources directly unless the aggregation provides some value. It's not clear what questions you're tying to answer by providing this metadata.

The problem that we are trying to address is make it easy for the user to be able to identify the artifact graph that gets produced by a supply chain (commit-123 produces image-abc). If I was building a tool on top of Cartographer to show this information, and all I had was the k8s object graph, I would need to reimplement a lot of logic that already exists in Cartographer. For instance, you would need to cross reference each template to know what fields to show from each resource. Cartographer already has this information, so we want to write it out somewhere so it can be consumed more easily.

jwntrs avatar Jan 13 '22 21:01 jwntrs

The problem that we are trying to address is make it easy for the user to be able to identify the artifact graph that gets produced by a supply chain (commit-123 produces image-abc)

You'll never actually know that "commit-123 produces image-abc" unless you also have a deep understanding of how the resource being templated behaves and also have knowledge of the how the template is implemented. At best you can know that a new commit (commit-123) triggered a process that emitted an image (image-abc).

scothis avatar Jan 18 '22 15:01 scothis

Some open questions that I've heard raised:

  • Should workload artifacts be wiped out when spec changes? If yes, can we fast forward through the supply chain matching known inputs to outputs and rebuild the artifact tree if nothing has changed?
  • Should we write pending artifacts (e.g. waiting on observedGeneration: 5) and move on? If we don't we might starve the supply chain if we're blocking until a specific observedGeneration appears on the status.
  • Should we write an empty artifact (only has an id) when using ClusterTemplate

jwntrs avatar Jan 18 '22 18:01 jwntrs

Should workload artifacts be wiped out when spec changes? If yes, can we fast forward through the supply chain matching known inputs to outputs and rebuild the artifact tree if nothing has changed?

No. A resource should reflect the current known state of the world on its status to the best of its ability. There will inherently be latency that's ok and expected. The status should indicate the recency of the status' content. (Generation/ObservedGeneration, LastTransitionTime, etc)

Should we write pending artifacts (e.g. waiting on observedGeneration: 5) and move on? If we don't we might starve the supply chain if we're blocking until a specific observedGeneration appears on the status.

A controller should almost never wait for an async condition to become true. The general pattern is to reflect the current state and watch "interesting" resources for changes. When one of the watched resources changes, reprocess the whole control loop and pickup the latest "current" values.

scothis avatar Jan 18 '22 19:01 scothis

I think perhaps this thinking is solid, and should be enforced, that all steps, no matter what, produce an artifact.

I would support such an RFC. @squeedee Do you think this should be a separate RFC, or folded into this one.


I'm going to fold it into this one!

waciumawanjohi avatar Jan 26 '22 19:01 waciumawanjohi

The problem that we are trying to address is make it easy for the user to be able to identify the artifact graph that gets produced by a supply chain (commit-123 produces image-abc)

You'll never actually know that "commit-123 produces image-abc" unless you also have a deep understanding of how the resource being templated behaves and also have knowledge of the how the template is implemented. At best you can know that a new commit (commit-123) triggered a process that emitted an image (image-abc).

@scothis @jwntrs This is an excellent discussion to move to RFC 20, our attempt to ensure that we can tie output N to input M.

waciumawanjohi avatar Jan 26 '22 19:01 waciumawanjohi

@waciumawanjohi any thoughts on this point?

Should we write an empty artifact (only has an id) when using ClusterTemplate

jwntrs avatar Jan 26 '22 19:01 jwntrs

Should workload artifacts be wiped out when spec changes? If yes, can we fast forward through the supply chain matching known inputs to outputs and rebuild the artifact tree if nothing has changed?

No. A resource should reflect the current known state of the world on its status to the best of its ability. There will inherently be latency that's ok and expected. The status should indicate the recency of the status' content. (Generation/ObservedGeneration, LastTransitionTime, etc)

Cartographer's current behavior is to continue promoting artifacts that were stamped with a previous workload spec. As long as that is the behavior, we should not wipe out artifacts from the workload status simply because the workload spec changes.

What would be the implication of no longer promoting artifacts built from previous workloads? Let's consider an example: a workload paired with a simple source (flux gitrepo) -> image (kpack) -> deploy (knative) supply chain. The workload has been doing work and there are valid artifacts at each stage. The user then updates the .spec.build.env field on the workload. What are the implications if we don't want old artifacts to be promoted?

  1. The source step is unaffected. Cartographer creates a proposed stamp of the object, sees that the spec is the same, and there's a no-op. Cartographer notes the output of the clusterSourceTemplate and moves to the next resource.
  2. For the imageTemplate, there is a change. Cartographer creates the proposed stamp, sees that the spec is different from what is on the cluster, and submits the new object definition. But this is a kpack image. It continues to expose latestImage. But "we don't want old artifacts to be promoted". Cartographer is unaware that this is an "old" artifact, or that it is connected to a previous workload spec! It would pick up this output and promote it to Knative.

To handle this scenario we would need to do one of two things.

  1. When the workload spec updates, before we update objects, delete them.
  2. Wait for objects to be in a successful state before promoting them. This begins to touch on RFC 20.

What would we gain by adopting one of these strategies? Well one thing that isn't gained is determinism about what is deployed to prod. The old workload spec would have delivered an arbitrary number of revisions through the supply chain. What we gain is clarity about which of those revisions were driven by the old workload spec and which were driven by the new workload spec.

There is a simpler way to achieve that goal: consider the workload version as a field to be captured in the from field of each artifact. As we have control over the fact that workload respects generation and observedGeneration (and we will presumably control the persistence layer when we take on historical record keeping), we should leverage those fields. This will protect us from the changes to the workload's resourceVersion that will occur every time we reconcile the workload.

@jwntrs @scothis

waciumawanjohi avatar Jan 26 '22 20:01 waciumawanjohi

Should we write an empty artifact (only has an id) when using ClusterTemplate

Yes. As long as there is a ClusterTemplate we can capture information about what objects were created on the cluster and what inputs led to them. I'll add that to the RFC.

waciumawanjohi avatar Jan 26 '22 20:01 waciumawanjohi

Should we write an empty artifact (only has an id) when using ClusterTemplate

Yes. As long as there is a ClusterTemplate we can capture information about what objects were created on the cluster and what inputs led to them. I'll add that to the RFC.

The rfc specifies that artifacts are oneOf(source,image,config) and it's not clear what to call the things created by ClusterTemplate. I'm going with object, and I'm very open to better suggestions.

waciumawanjohi avatar Jan 26 '22 20:01 waciumawanjohi

To handle this scenario we would need to do one of two things... Wait for objects to be in a successful state before promoting them. This begins to touch on RFC 20.

RFC 20 now proposes exactly this. Assuming RFC 20 is accepted, we could then either wipe the artifact field for a workload when the spec is updated or note the generation of the workload in the from field of each artifact. From an information point of view they would be equivalent. There may be implications for persistence, but without a clearer picture of what we may do there I don't have a preference between these strategies.

waciumawanjohi avatar Jan 29 '22 08:01 waciumawanjohi

Borrowing from the intent behind this RFC and 20, I'd like to propose an alternate status shape for Workloads. Not all of the fields are expected to be populatable immediately, but it gives a direction to grow towards while leaving room for additions over time. Other changes to the cartographer model will be required to be able to surface some of these fields.

Key differences:

  • focus on resources rather than artifacts. Resources are the nodes of the graph, the artifacts are an output of a node along an edge to another node.
  • normalize object type, they are so similar the previous oneOf adds more complexity than it adds clarity. Remember that users are not writing these types, they are reading them.
  • keeps extraneous fields out of object references, separates references to template and stamped resources
  • resources are linked by resource names rather than by values that happen to be the same
...
status:
  observedGeneration: # int64
  conditions: # []metav1.Condition
  supplyChainRef:
    apiVersion: # string
    kind: # string
    namespace: # string (empty for cluster scope)
    name: # string
  resources:
  - name: # string (resource name as defined by the supply chain)
    ref: # corev1.ObjectReference (stamped resource)
      apiVersion: # string
      kind: # string
      namespace: # string (empty for cluster scope)
      name: # string
    templateRef: # corev1.ObjectReference (template resource)
      apiVersion: # string
      kind: # string
      namespace: # string (empty for cluster scope)
      name: # string
    inputs: # (I'm not sure how practical this is today given Cartographer's loose relationship between templates)
    - name: # string (resource name)
      value: # optional json.RawExtension
    outputs:
    - name: # string
      value: # json.RawExtension (may be too large to reasonably include)
      lastTransitionTime: # metav1.Time

    # these fields are modeled after metav1.Condition
    type: # enum 'Source', 'Image', or 'Config' (consider dropping if redundant with templateRef.kind)
    status: # enum 'True', 'False', or 'Unknown' (based on metav1.ConditionStatus)
    observedGeneration: # int64 (alternate to resourceVersion)
    lastTransitionTime: # metav1.Time
    reason: # string
    message: # string

scothis avatar Feb 02 '22 14:02 scothis

interesting @scothis, trying to make some sense of it making a little more concrete with an example supplychain (source-provider -> image-builder):

status:
  resources:
    - name: source-provider
      ref:
        kind: GitRepository ...
      templateRef:
        kind: ClusterSourceTemplate ...
      inputs: []
      outputs:
        - name: ?
          value:  {url: git.io/foo/bar, ref: b4db33f}
          lastTransitionTime: 1
        - name: ?
          value:  {url: git.io/foo/bar, ref: c001b33f}
          lastTransitionTime: 2
      type: Source
      status: True
      lastTransitionTime: 2

    - name: image-builder
      ref:
        kind: Image
      templateRef:
        kind: ClusterImageTemplate
      inputs:
        - name: source-provider
          value: {url: git.io/foo/bar, ref: b4db33f
        - name: source-provider
          value: {url: git.io/foo/bar, ref: c001b33f
      outputs:
        - name: ?                                             # (!!)
          value:  {image: registry.io/foo@sha256:010101}
          lastTransitionTime: 3
      type: Image
      status: True
      lastTransitionTime: 3

in the example above, just looking at status.resources.image-builder.outputs one wouldn't be able to tell if that output came from the first input (b4db33f or the second c001b33f as the outputs are not linked to the inputs) - am I thinking about this right?

thx!

cirocosta avatar Feb 02 '22 16:02 cirocosta

@cirocosta inputs/outputs were not intended to be a historical record of how a value changed over time, but to capture the "current" output/input for a resource. The definition of current will likely evolve based on other proposals.

So what you sketched as:

      outputs:
        - name: ?
          value:  {url: git.io/foo/bar, ref: b4db33f}
          lastTransitionTime: 1
        - name: ?
          value:  {url: git.io/foo/bar, ref: c001b33f}
          lastTransitionTime: 2

would actually be

      outputs:
        - name: url
          value:  git.io/foo/bar
          lastTransitionTime: 1
        - name: ref
          value:  c001b33f
          lastTransitionTime: 2

Note: the transition times are different because the url didn't change, but the ref did in your example.

scothis avatar Feb 02 '22 16:02 scothis

@scothis What you've proposed is achievable. But I wonder what is the motivation for status that is proposed here. The original user motivation (written originally in RFC 14) stated: "it is important to know when both inputs have fully traversed the supply chain." This proposal doesn't provide information necessary to answer that question from the workload status.

waciumawanjohi avatar Feb 02 '22 21:02 waciumawanjohi

Yes. As long as there is a ClusterTemplate we can capture information about what objects were created on the cluster and what inputs led to them. I'll add that to the RFC.

Why do we want this to be typed? Aren't these two stipulations enough?

  1. A ClusterTemplate's outputRef can refer to any node of any type (string, array, deeply nested object)
  2. A ClusterTemplate's output is never a valid input. For valid inputs, you must use one of the typed Template kinds.

squeedee avatar Feb 02 '22 23:02 squeedee

Why do we want this to be typed?

In the workshopped format, the type is the top level field. Shall we change that to:

status:
  artifacts:
    - # the sha256 of the ordered JSON of all other non-from fields
      id: <SHA:string>
      # optional: oneOf(source,image,config)
      source:
        # exposed fields - in this case url and revision
        uri: <:string>
        revision: <:string>
      # the object which produced this artifact
      resource:
        # name of the resource in the supply chain
        resource-name: <:string>
        # GVK of the resource
        kind: <:string>
        apiVersion: <:string>
        # name of the resource on the cluster
        name: <:string>
        # namespace of the resource
        namespace: <:string>
        # resource version of the object
        resourceVersion: <:string>
      from:
        - # id of any artifact(s) that were inputs to the template
          id: <SHA:string>

waciumawanjohi avatar Feb 07 '22 23:02 waciumawanjohi