cartographer
cartographer copied to clipboard
Introduce RFC 18 Workload Report Artifact Provenance
name: RFC 18 about: Workload Report Artifact Provenance Readable
Changes proposed by this PR
Opens RFC 18
Release Note
PR Checklist
Note: Please do not remove items. Mark items as done [x]
or use ~strikethrough~ if you believe they are not relevant
~- [ ] Linked to a relevant issue. Eg: Fixes #123
or Updates #123
~
~- [ ] Removed non-atomic or wip
commits~
~- [ ] Filled in the Release Note section above~
~- [ ] Modified the docs to match changes~
✔️ Deploy Preview for elated-stonebraker-105904 canceled.
🔨 Explore the source changes: 2dd4ee0d3518d727a7acbf2a9e843677c24fcb20
🔍 Inspect the deploy log: https://app.netlify.com/sites/elated-stonebraker-105904/deploys/61fb0b5bc32ed7000746e187
Copying over my comments from RFC14 that are still relevant:
In the office hours this week (Jan 10th, 2022), we talked about how we could extend this type of artifact tracing to track artifacts across clusters.
Assuming you started with a SupplyChain that looked something like the following (from git-writer example):
[source-provider] -> [image-builder] -> [config] -> [git-writer]
But you made a small change and wrapped the git-writer in the SupplyChain with a ClusterSourceTemplate
, it would let you emit the revision/url of the committed config. By doing this you would end up with an artifact representing the commit to the config-repo as the last item in the status of your workload.
So your workload status might end up looking like this:
artifacts:
- source:
id: 000
url: https://github.com/my-org/my-source
revision: abc123
passed:
- resource-name: source-provider
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: GitRepository
- image:
id: 111
image: gcri.io/asdfsdfasdf
passed:
- resource-name: image-builder
apiVersion: kpack.io/v1alpha2
kind: Image
from:
- id: 000
- config:
id: 222
config: shasum(of-some-config)
passed:
- resource-name: my-config
apiVersion: v1
kind: ConfigMap
from:
- id: 111
- source:
id: 333
url: https://github.com/my-org/my-config
revision: xyz789
passed:
- resource-name: config-writer
apiVersion: carto.run/v1alpha1
kind: Runnable
from:
- id: 222
Now lets assume you are following the basic-delivery example on another cluster.
You would end up with a Deliverable with the following status:
artifacts:
- source:
id: 333
url: https://github.com/my-org/my-config
revision: xyz789
passed:
- resource-name: config-provider
apiVersion: source.toolkit.fluxcd.io/v1beta1
kind: GitRepository
- source: some other-artifact
...
You can see how we could correlate these two completely independent resources, based solely on their artifacts.
Copying over my comments from RFC14 that are still relevant:
@rawlingsj made a really insightful comment in another unrelated thread that we should consider here:
Customers will likely not be willing to give access from a dev cluster (and developer users) to others such as staging / production. Instead it will probably be easier to send events out of those locked down clusters to a centralized place for users to get visibility into their environments.
My comment here definitely assumes read access to all clusters. So maybe we should start thinking about writing out artifacts as k8s events at the same time as writing them to the workload/deliverable status. That way if someone is already emitting k8s events from their prod clusters, we can integrate into their existing tooling.
But you made a small change and wrapped the git-writer in the SupplyChain with a
ClusterSourceTemplate
@jwntrs I think perhaps this thinking is solid, and should be enforced, that all steps, no matter what, produce an artifact. Even in the case of a kpack deploy, we should have a kpack deploy ID or something of that nature, as an artifact. That way, every Supply Chain Resource has representation in the artifacts list.
@scothis
I'm in favor of capturing a reference to the resources created for a Workload on the status, but I'm skeptical that capturing the output values will give users the information they think they are getting. There's little point in merely duplicating state that could instead be looked up on the child resources directly unless the aggregation provides some value. It's not clear what questions you're tying to answer by providing this metadata.
The problem that we are trying to address is make it easy for the user to be able to identify the artifact graph that gets produced by a supply chain (commit-123
produces image-abc
). If I was building a tool on top of Cartographer to show this information, and all I had was the k8s object graph, I would need to reimplement a lot of logic that already exists in Cartographer. For instance, you would need to cross reference each template to know what fields to show from each resource. Cartographer already has this information, so we want to write it out somewhere so it can be consumed more easily.
The problem that we are trying to address is make it easy for the user to be able to identify the artifact graph that gets produced by a supply chain (
commit-123
producesimage-abc
)
You'll never actually know that "commit-123
produces image-abc
" unless you also have a deep understanding of how the resource being templated behaves and also have knowledge of the how the template is implemented. At best you can know that a new commit (commit-123
) triggered a process that emitted an image (image-abc
).
Some open questions that I've heard raised:
- Should workload artifacts be wiped out when spec changes? If yes, can we fast forward through the supply chain matching known inputs to outputs and rebuild the artifact tree if nothing has changed?
- Should we write pending artifacts (e.g. waiting on observedGeneration: 5) and move on? If we don't we might starve the supply chain if we're blocking until a specific observedGeneration appears on the status.
- Should we write an empty artifact (only has an id) when using ClusterTemplate
Should workload artifacts be wiped out when spec changes? If yes, can we fast forward through the supply chain matching known inputs to outputs and rebuild the artifact tree if nothing has changed?
No. A resource should reflect the current known state of the world on its status to the best of its ability. There will inherently be latency that's ok and expected. The status should indicate the recency of the status' content. (Generation/ObservedGeneration, LastTransitionTime, etc)
Should we write pending artifacts (e.g. waiting on observedGeneration: 5) and move on? If we don't we might starve the supply chain if we're blocking until a specific observedGeneration appears on the status.
A controller should almost never wait for an async condition to become true. The general pattern is to reflect the current state and watch "interesting" resources for changes. When one of the watched resources changes, reprocess the whole control loop and pickup the latest "current" values.
I think perhaps this thinking is solid, and should be enforced, that all steps, no matter what, produce an artifact.
I would support such an RFC. @squeedee Do you think this should be a separate RFC, or folded into this one.
I'm going to fold it into this one!
The problem that we are trying to address is make it easy for the user to be able to identify the artifact graph that gets produced by a supply chain (commit-123 produces image-abc)
You'll never actually know that "commit-123 produces image-abc" unless you also have a deep understanding of how the resource being templated behaves and also have knowledge of the how the template is implemented. At best you can know that a new commit (commit-123) triggered a process that emitted an image (image-abc).
@scothis @jwntrs This is an excellent discussion to move to RFC 20, our attempt to ensure that we can tie output N to input M.
@waciumawanjohi any thoughts on this point?
Should we write an empty artifact (only has an id) when using ClusterTemplate
Should workload artifacts be wiped out when spec changes? If yes, can we fast forward through the supply chain matching known inputs to outputs and rebuild the artifact tree if nothing has changed?
No. A resource should reflect the current known state of the world on its status to the best of its ability. There will inherently be latency that's ok and expected. The status should indicate the recency of the status' content. (Generation/ObservedGeneration, LastTransitionTime, etc)
Cartographer's current behavior is to continue promoting artifacts that were stamped with a previous workload spec. As long as that is the behavior, we should not wipe out artifacts from the workload status simply because the workload spec changes.
What would be the implication of no longer promoting artifacts built from previous workloads? Let's consider an example: a workload paired with a simple source (flux gitrepo) -> image (kpack) -> deploy (knative) supply chain. The workload has been doing work and there are valid artifacts at each stage. The user then updates the .spec.build.env
field on the workload. What are the implications if we don't want old artifacts to be promoted?
- The source step is unaffected. Cartographer creates a proposed stamp of the object, sees that the spec is the same, and there's a no-op. Cartographer notes the output of the clusterSourceTemplate and moves to the next resource.
- For the imageTemplate, there is a change. Cartographer creates the proposed stamp, sees that the spec is different from what is on the cluster, and submits the new object definition. But this is a kpack image. It continues to expose
latestImage
. But "we don't want old artifacts to be promoted". Cartographer is unaware that this is an "old" artifact, or that it is connected to a previous workload spec! It would pick up this output and promote it to Knative.
To handle this scenario we would need to do one of two things.
- When the workload spec updates, before we update objects, delete them.
- Wait for objects to be in a successful state before promoting them. This begins to touch on RFC 20.
What would we gain by adopting one of these strategies? Well one thing that isn't gained is determinism about what is deployed to prod. The old workload spec would have delivered an arbitrary number of revisions through the supply chain. What we gain is clarity about which of those revisions were driven by the old workload spec and which were driven by the new workload spec.
There is a simpler way to achieve that goal: consider the workload version as a field to be captured in the from
field of each artifact. As we have control over the fact that workload respects generation and observedGeneration (and we will presumably control the persistence layer when we take on historical record keeping), we should leverage those fields. This will protect us from the changes to the workload's resourceVersion that will occur every time we reconcile the workload.
@jwntrs @scothis
Should we write an empty artifact (only has an id) when using ClusterTemplate
Yes. As long as there is a ClusterTemplate we can capture information about what objects were created on the cluster and what inputs led to them. I'll add that to the RFC.
Should we write an empty artifact (only has an id) when using ClusterTemplate
Yes. As long as there is a ClusterTemplate we can capture information about what objects were created on the cluster and what inputs led to them. I'll add that to the RFC.
The rfc specifies that artifacts are oneOf(source,image,config)
and it's not clear what to call the things created by ClusterTemplate. I'm going with object
, and I'm very open to better suggestions.
To handle this scenario we would need to do one of two things... Wait for objects to be in a successful state before promoting them. This begins to touch on RFC 20.
RFC 20 now proposes exactly this. Assuming RFC 20 is accepted, we could then either wipe the artifact field for a workload when the spec is updated or note the generation of the workload in the from
field of each artifact. From an information point of view they would be equivalent. There may be implications for persistence, but without a clearer picture of what we may do there I don't have a preference between these strategies.
Borrowing from the intent behind this RFC and 20, I'd like to propose an alternate status shape for Workloads. Not all of the fields are expected to be populatable immediately, but it gives a direction to grow towards while leaving room for additions over time. Other changes to the cartographer model will be required to be able to surface some of these fields.
Key differences:
- focus on resources rather than artifacts. Resources are the nodes of the graph, the artifacts are an output of a node along an edge to another node.
- normalize object type, they are so similar the previous oneOf adds more complexity than it adds clarity. Remember that users are not writing these types, they are reading them.
- keeps extraneous fields out of object references, separates references to template and stamped resources
- resources are linked by resource names rather than by values that happen to be the same
...
status:
observedGeneration: # int64
conditions: # []metav1.Condition
supplyChainRef:
apiVersion: # string
kind: # string
namespace: # string (empty for cluster scope)
name: # string
resources:
- name: # string (resource name as defined by the supply chain)
ref: # corev1.ObjectReference (stamped resource)
apiVersion: # string
kind: # string
namespace: # string (empty for cluster scope)
name: # string
templateRef: # corev1.ObjectReference (template resource)
apiVersion: # string
kind: # string
namespace: # string (empty for cluster scope)
name: # string
inputs: # (I'm not sure how practical this is today given Cartographer's loose relationship between templates)
- name: # string (resource name)
value: # optional json.RawExtension
outputs:
- name: # string
value: # json.RawExtension (may be too large to reasonably include)
lastTransitionTime: # metav1.Time
# these fields are modeled after metav1.Condition
type: # enum 'Source', 'Image', or 'Config' (consider dropping if redundant with templateRef.kind)
status: # enum 'True', 'False', or 'Unknown' (based on metav1.ConditionStatus)
observedGeneration: # int64 (alternate to resourceVersion)
lastTransitionTime: # metav1.Time
reason: # string
message: # string
interesting @scothis, trying to make some sense of it making a little more
concrete with an example supplychain (source-provider -> image-builder
):
status:
resources:
- name: source-provider
ref:
kind: GitRepository ...
templateRef:
kind: ClusterSourceTemplate ...
inputs: []
outputs:
- name: ?
value: {url: git.io/foo/bar, ref: b4db33f}
lastTransitionTime: 1
- name: ?
value: {url: git.io/foo/bar, ref: c001b33f}
lastTransitionTime: 2
type: Source
status: True
lastTransitionTime: 2
- name: image-builder
ref:
kind: Image
templateRef:
kind: ClusterImageTemplate
inputs:
- name: source-provider
value: {url: git.io/foo/bar, ref: b4db33f
- name: source-provider
value: {url: git.io/foo/bar, ref: c001b33f
outputs:
- name: ? # (!!)
value: {image: registry.io/foo@sha256:010101}
lastTransitionTime: 3
type: Image
status: True
lastTransitionTime: 3
in the example above, just looking at status.resources.image-builder.outputs
one wouldn't be able to tell if that output came from the first input
(b4db33f
or the second c001b33f
as the outputs are not linked to the
inputs) - am I thinking about this right?
thx!
@cirocosta inputs/outputs were not intended to be a historical record of how a value changed over time, but to capture the "current" output/input for a resource. The definition of current will likely evolve based on other proposals.
So what you sketched as:
outputs:
- name: ?
value: {url: git.io/foo/bar, ref: b4db33f}
lastTransitionTime: 1
- name: ?
value: {url: git.io/foo/bar, ref: c001b33f}
lastTransitionTime: 2
would actually be
outputs:
- name: url
value: git.io/foo/bar
lastTransitionTime: 1
- name: ref
value: c001b33f
lastTransitionTime: 2
Note: the transition times are different because the url didn't change, but the ref did in your example.
@scothis What you've proposed is achievable. But I wonder what is the motivation for status that is proposed here. The original user motivation (written originally in RFC 14) stated: "it is important to know when both inputs have fully traversed the supply chain." This proposal doesn't provide information necessary to answer that question from the workload status.
Yes. As long as there is a ClusterTemplate we can capture information about what objects were created on the cluster and what inputs led to them. I'll add that to the RFC.
Why do we want this to be typed? Aren't these two stipulations enough?
- A ClusterTemplate's
outputRef
can refer to any node of any type (string, array, deeply nested object) - A ClusterTemplate's
output
is never a valid input. For valid inputs, you must use one of thetyped
Template kinds.
Why do we want this to be typed?
In the workshopped format, the type is the top level field. Shall we change that to:
status:
artifacts:
- # the sha256 of the ordered JSON of all other non-from fields
id: <SHA:string>
# optional: oneOf(source,image,config)
source:
# exposed fields - in this case url and revision
uri: <:string>
revision: <:string>
# the object which produced this artifact
resource:
# name of the resource in the supply chain
resource-name: <:string>
# GVK of the resource
kind: <:string>
apiVersion: <:string>
# name of the resource on the cluster
name: <:string>
# namespace of the resource
namespace: <:string>
# resource version of the object
resourceVersion: <:string>
from:
- # id of any artifact(s) that were inputs to the template
id: <SHA:string>