kubevela
kubevela copied to clipboard
Apply component workflow fails for component of terraform schematic
Apply component workflow fails for a component of schematic type terraform
We have a custom component of schematic type terraform which creates dynamodb table. We create tables by applying applications that use this component, and use workflow apply-component
.
After few seconds the workflow fails with following error
run step(provider=oam,do=component-apply): CollectHealthStatus: app=test, comp=<redacted>, check health error: no matches for kind "Configuration" in version "terraform.core.oam.dev/v1beta2"
This error is however misleading since it is creating the Configuration
object, running the apply job successfully and creating the dynamodb table as expected. The configuration object is also in ready state after sometime.
My guess is that the workflow step waits for a while to see if the Configuration
object is in a ready state and fails if it is not. Since apply job takes a while to complete it does not come to ready state in duration expected by the workflow timeout.
Kubevela Version 1.9.6
Component definition
apiVersion: core.oam.dev/v1beta1
kind: ComponentDefinition
metadata:
annotations:
definition.oam.dev/description: Terraform module which creates DynamoDB table
on AWS
creationTimestamp: null
labels:
type: terraform-aws
name: tf-aws-dynamodb-table
namespace: vela-system
spec:
schematic:
terraform:
configuration: https://github.com/Guidewire/terraform-aws-dynamodb-table.git
providerRef:
name: aws
namespace: default
type: remote
workload:
definition:
apiVersion: terraform.core.oam.dev/v1beta1
kind: Configuration
status: {}
Sample application
apiVersion: core.oam.dev/v1beta1
kind: Application
metadata:
name: test
spec:
components:
- name: test
type: tf-aws-dynamodb-table
properties:
name: "test"
hash_key: "id"
ttl_enabled: true
ttl_attribute_name: "ts"
autoscaling_enabled: true
stream_enabled: true
stream_view_type : "NEW_AND_OLD_IMAGES"
attributes:
- name: "id"
type: "N"
replica_regions:
- region_name: us-east-1
- region_name: us-west-1
tags:
Key: "Val"
policies:
- name: apply-once
type: apply-once
properties:
enable: true
workflow:
steps:
- name: create-dynamodb
type: apply-component
properties:
component: test
@shreyasHpandya : can you look at the logs of the kubevela pods in the vela-system namespace? Specifically any jobs that are spun to apply the terraform?
My initial investigation leads me to believe the issue is in apply-component workflow step.
This error is however misleading since it is creating the Configuration object, running the apply job successfully and creating the dynamodb table as expected. The configuration object is also in ready state after sometime.
My guess is that the workflow step waits for a while to see if the Configuration object is in a ready state and fails if it is not. Since apply job takes a while to complete it does not come to ready state in duration expected by the workflow timeout.
I will look in the Kubevela code. @chivalryq can you help me with where I might look into the code for this issue?
It's uncommon for reporting no matches for kind "Configuration" in version "terraform.core.oam.dev/v1beta2"
because this GVK has been registered to the underlying client in vela-core. I'm sure GVK terraform.core.oam.dev/v1beta2.Configuration
has been registered.
Error is reported in this funciton.
https://github.com/kubevela/kubevela/blob/6cbc12f9bb4d1059f1dd439ebbc29dafe7190da1/pkg/controller/core.oam.dev/v1beta1/application/apply.go#L262
Now this client is set here: https://github.com/kubevela/kubevela/blob/6cbc12f9bb4d1059f1dd439ebbc29dafe7190da1/pkg/controller/core.oam.dev/v1beta1/application/apply.go#L80
It's reconciler's client. And reconciler's client comes from manager. Manager's client initialization process
https://github.com/kubevela/kubevela/blob/6cbc12f9bb4d1059f1dd439ebbc29dafe7190da1/cmd/core/app/server.go#L135-L156
Scheme comes from common.Scheme. This is it: we can see terraform v1beta2 has been registered.
https://github.com/kubevela/kubevela/blob/6cbc12f9bb4d1059f1dd439ebbc29dafe7190da1/pkg/utils/common/common.go#L73-L97
@shreyasHpandya You can start from the trace above to check why the client can't recognize the GVK since it has been registerd. The client or scheme could be replaced somewhere or the this client is re-assigned.
Hello @chivalryq ,
Thanks for the tip. It was very helpful. Debug setup:
- Create a local cluster with
k3d cluster create
- run
make core-install
andmake def-install
from the Kubevela repo. - Start dlv debugger at
cmd/core/main.go
.
My understanding of controller-runtime client Scheme:
The runtime.Scheme maintains an in-memory mapping of GVK's to go types and vice-versa as well as maps of known GV's and any unstructured types. It consults the RESTMapper to figure out the kube API endpoints and at a high level acts as an intermediary cache between the client and kube-api-server by maintaining the current spec mapping of all available GVK's . Sort of. I am yet unclear on who is responsible for actually creating new CRD's from imported controllers in k8s and also when should this happen. For example, in https://github.com/kubevela/kubevela/blob/6cbc12f9bb4d1059f1dd439ebbc29dafe7190da1/pkg/utils/common/common.go#L73-L97
, when the terraform api's are imported, and the terraform-controller init()
is called, should we expect the GVK to be visible via kubectl ? If not, when ? Time delay doesn't look to be a factor.
https://github.com/kubevela/terraform-controller/blob/966471af19a07ffe94159e231899a0983a71c188/api/v1beta2/configuration_types.go#L191-L193
Observations:
-
Once vela-core is executed, controllers for all built-in GVK's come up. The managers scheme has registered both the terraform GVK's
v1beta1
andv1beta2
. The Configuration CRD is not yet applied and visible via kubectl. -
Apply the ComponentDefinition and Application mentioned above in the original report. The
Reconciler
's client and Scheme both are the same as the managers Scheme. The Scheme includes :- Both
terraform.core.oam.dev/v1beta1
andterraform.core.oam.dev/v1beta2
inobservedVersions
. - Atleast
terraform.core.oam.dev/v1beta1/Configuration
kind mapping ingvkToType
andtypeToGVK
maps. Theterraform.core.oam.dev/v1beta2
GV also has some kinds listed. BothgvkToType
andtypeToGVK
have hundreds of listings, so I might have missed theConfiguration
forterraform.core.oam.dev/v1beta2
.
- Both
-
The reconciler seems to parse and properly generate the
appfile
.NoKindMatchError
forConfiguration
is thrown at https://github.com/kubevela/kubevela/blob/1a001e5b29da766f8272a5f3f99b215c3fb13a7a/pkg/utils/apply/apply.go#L286-L312 -
The workflow fails with error
run step(provider=oam,do=component-apply): Dispatch: pre-dispatch dryrun failed: Found 1 errors. [(cannot get object: no matches for kind "Configuration" in version "terraform.core.oam.dev/v1beta1")]
. Slightly different error if pre-dispatch checks are disabled. -
In my local setup, the flow never seems to reach https://github.com/kubevela/kubevela/blob/6cbc12f9bb4d1059f1dd439ebbc29dafe7190da1/pkg/controller/core.oam.dev/v1beta1/application/apply.go#L262
-
As far as I can see the Scheme seems to be consistent throughout we don't seem to be interfering with its state anywhere. Not sure why this is intermittent in our Prod deployments.
Any advice would be great. Thanks in advance.
On further investigation, the *runtime.Scheme
doesn't seem to have anything to do with installing the terraform-controller
CRD's. We will try ensuring enough time delay between installing the CRD's and adding a terraform
schematic Application.
@chivalryq : do you have any ideas on this?
We were able to test and resolve this tentatively by ensuring that the terraform-controller CRD's are installed before vela-core
boots up. Our best guess is that this is a controller-runtime
issue. There seem to existing issues around controller-runtime cache if the CRD's are not installed:
https://github.com/kubernetes-sigs/controller-runtime/issues/2456
https://github.com/kubernetes-sigs/controller-runtime/issues/2589
https://github.com/kubernetes-sigs/controller-runtime/issues/1759
Additional log trace for anyone else running into this:
apply.go:412] "[Finished]: i-lrlc854n.apply-policies(finish apply policies)" application="default/test-bin-repo-app" controller="application" resource_version="160683" generation=1 publish_version="alpha1" duration="1.63µs" spanID="i-lrlc854n.apply-policies"
I0304 15:41:10.001743 1 generator.go:76] "[Finished]: i-lrlc854n.generate-task-runners(finish generate task runners)" application="default/test-bin-repo-app" controller="application" resource_version="160683" generation=1 publish_version="alpha1" duration="116.981µs" spanID="i-lrlc854n.generate-task-runners"
I0304 15:41:10.012291 1 assemble.go:69] "Successfully assemble a workload" workload="default/test-bin-repo-app" APIVersion="terraform.core.oam.dev/v1beta2" Kind="Configuration"
I0304 15:41:10.020192 1 apply.go:126] "skip update" name="test-bin-repo-app" resource="terraform.core.oam.dev/v1beta2, Kind=Configuration"
I0304 15:41:10.046566 1 apply.go:126] "skip update" name="test-bin-repo-app" resource="terraform.core.oam.dev/v1beta2, Kind=Configuration"
E0304 15:41:10.046801 1 task.go:252] "do steps" err="run step(provider=oam,do=component-apply): CollectHealthStatus: app=test-bin-repo-app, comp=test-bin-repo-app, check health error: no matches for kind \"Configuration\" in version \"terraform.core.oam.dev/v1beta2\"" application="default/test-bin-repo-app" controller="application" resource_version="160683" generation=1 publish_version="alpha1" step_name="test-bin-repo-app" step_type="builtin-apply-component" spanID="i-lrlc854n.execute application workflow.efrta1kpup"```