metaflow
metaflow copied to clipboard
Long argo-events sensor names can cause sensor deployments to fail
Currently, sensor names are created by replacing . with - in the argo-workflow names.
# Register sensor. Unfortunately, Argo Events Sensor names don't allow for
# dots (sensors run into an error) which rules out self.name :(
# Metaflow will overwrite any existing sensor.
sensor_name = self.name.replace(".", "-")
if self._sensor:
argo_client.register_sensor(sensor_name, self._sensor.to_json())
Subsequently, argo-events creates a deployment for each sensor. This deployment's name is created by adding a suffix to the sensor name.
If the sensor name is <= 63 characters, things seem to work fine. However, if the sensor name > 63 characters, the deployment object fails to get created:
{"level":"error","ts":1697213357.2393486,"logger":"argo-events.eventbus-controller","caller":"sensor/controller.go:74","msg":"reconcile error","namespace":"jobs-default","sensor":"abcreditscorev4d4mx-test-pqrsmnbv44testrun-candidatemodelevaluation","error":"Deployment.apps \"abcreditscorev4d4mx-test-pqrsmnbv44testrun-candidatemodelemmc5m\" is invalid: [metadata.labels: Invalid value: \"nbcreditscorev4d4mx-test-annamnbv44testrun-candidatemodelevaluation\": must be no more than 63 characters, spec.selector.matchLabels: Invalid value: \"abcreditscorev4d4mx-test-pqrsmnbv44testrun-candidatemodelevaluation\": must be no more than 63 characters, spec.selector: Invalid value: v1.LabelSelector{MatchLabels:map[string]string{\"controller\":\"sensor-controller\", \"owner-name\":\"abcreditscorev4d4mx-test-pqrsmnbv44testrun-candidatemodelevaluation\", \"sensor-name\":\"abcreditscorev4d4mx-test-pqrsmnbv44testrun-candidatemodelevaluation\"}, MatchExpressions:[]v1.LabelSelectorRequirement(nil)}: invalid label selector]","stacktrace":"github.com/argoproj/argo-events/controllers/sensor.(*reconciler).Reconcile\n\t/home/runner/work/argo-events/argo-events/controllers/sensor/controller.go:74\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Reconcile\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:114\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).reconcileHandler\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:311\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).processNextWorkItem\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:266\nsigs.k8s.io/controller-runtime/pkg/internal/controller.(*Controller).Start.func2.2\n\t/home/runner/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227"}
This error shows up in the argo-events controller. Metaflow itself doesn't see this error and therefore the user feels like everything is working just fine. So, it's hard to find.
Is there a specific reason for the sensor names to match argo-workflow names? If not, we could simply name the sensor as s-[hash-of-argo-workflow-name] and add the workflow name as a label on the sensor object.
I think this is an efficient way to do this, since adding the workflowname as a label on to the object maintains the naming conventions for the sensor names and adds an additional layer of metadata for organization. Created a PR #1597 for this if you would like to check it out and provide some feedback--I'd appreciate it a lot!