Armada scheduler crashes due to zero'd job run UUID
Armada scheduler is in a crash loop. There appears to be a zero'd out job run UUID getting created or deserialized somewhere. I haven't looked into the source code yet but I'm under the impression that the job run UUID is generated by Armada.
DEBU[2024-05-22T18:45:37.423Z]wrapper_logrus.go:63 Write data: 13 local_addr="10.14.6.68:59620" remote_addr="pulsar://pulsar-broker-0.pulsar-broker.armada.svc.cluster.local:6650"
panic: uuid: Parse(): invalid UUID length: 0
goroutine 258 [running]:
github.com/google/uuid.MustParse({0x0, 0x0})
/home/runner/go/pkg/mod/github.com/google/[email protected]/uuid.go:169 +0xa5
github.com/armadaproject/armada/internal/scheduler.(*MetricsCollector).updateClusterMetrics(0xc000f96f60, 0xc00051a300?)
/home/runner/work/armada/armada/internal/scheduler/metrics.go:273 +0x1926
github.com/armadaproject/armada/internal/scheduler.(*MetricsCollector).refresh(0xc000f96f60, 0xc00051a300)
/home/runner/work/armada/armada/internal/scheduler/metrics.go:121 +0xb3
github.com/armadaproject/armada/internal/scheduler.(*MetricsCollector).Run(0xc000f96f60, 0xc00051a300)
/home/runner/work/armada/armada/internal/scheduler/metrics.go:89 +0x14a
github.com/armadaproject/armada/internal/scheduler.Run.func9()
/home/runner/work/armada/armada/internal/scheduler/schedulerapp.go:353 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0xa5
```
Running on Armada 0.4.48
The postgres DB is showing zero runs with a zero'd UUID
postgres=# SELECT * from runs where run_id = '00000000-0000-0000-0000-000000000000'; run_id | job_id | created | job_set | executor | node | cancelled | running | succeeded | failed | returned | run_attempted | serial | last_modified | leased_timestamp | pending_timestamp | running_timestamp | terminated_timestamp | scheduled_at_priority | preempted | pending | preempted_timestamp | pod_requirements_overlay | preempt_requested | queue --------+--------+---------+---------+----------+------+-----------+---------+-----------+--------+----------+---------------+--------+---------------+------------------+------------------- +-------------------+----------------------+-----------------------+-----------+---------+---------------------+--------------------------+-------------------+------- (0 rows)
postgres=# SELECT * from runs where run_id = NULL; run_id | job_id | created | job_set | executor | node | cancelled | running | succeeded | failed | returned | run_attempted | serial | last_modified | leased_timestamp | pending_timestamp | running_timestamp | terminated_timestamp | scheduled_at_priority | preempted | pending | preempted_timestamp | pod_requirements_overlay | preempt_requested | queue --------+--------+---------+---------+----------+------+-----------+---------+-----------+--------+----------+---------------+--------+---------------+------------------+------------------- +-------------------+----------------------+-----------------------+-----------+---------+---------------------+--------------------------+-------------------+------- (0 rows)
So at this point I wonder if there is a decompression/deserialization issue with the node data model.
To provide an update, we discovered this issue was caused by running a v3 and v4 executor in the same cluster. V4 executors assume that job run id labels will be on pods, but they may not be on pods kicked off by v3 executors.