armada icon indicating copy to clipboard operation
armada copied to clipboard

Armada scheduler crashes due to zero'd job run UUID

Open Sovietaced opened this issue 1 year ago • 1 comments

Armada scheduler is in a crash loop. There appears to be a zero'd out job run UUID getting created or deserialized somewhere. I haven't looked into the source code yet but I'm under the impression that the job run UUID is generated by Armada.

DEBU[2024-05-22T18:45:37.423Z]wrapper_logrus.go:63 Write data: 13                                local_addr="10.14.6.68:59620" remote_addr="pulsar://pulsar-broker-0.pulsar-broker.armada.svc.cluster.local:6650"
panic: uuid: Parse(): invalid UUID length: 0

goroutine 258 [running]:
github.com/google/uuid.MustParse({0x0, 0x0})
	/home/runner/go/pkg/mod/github.com/google/[email protected]/uuid.go:169 +0xa5
github.com/armadaproject/armada/internal/scheduler.(*MetricsCollector).updateClusterMetrics(0xc000f96f60, 0xc00051a300?)
	/home/runner/work/armada/armada/internal/scheduler/metrics.go:273 +0x1926
github.com/armadaproject/armada/internal/scheduler.(*MetricsCollector).refresh(0xc000f96f60, 0xc00051a300)
	/home/runner/work/armada/armada/internal/scheduler/metrics.go:121 +0xb3
github.com/armadaproject/armada/internal/scheduler.(*MetricsCollector).Run(0xc000f96f60, 0xc00051a300)
	/home/runner/work/armada/armada/internal/scheduler/metrics.go:89 +0x14a
github.com/armadaproject/armada/internal/scheduler.Run.func9()
	/home/runner/work/armada/armada/internal/scheduler/schedulerapp.go:353 +0x25
golang.org/x/sync/errgroup.(*Group).Go.func1()
	/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:78 +0x64
created by golang.org/x/sync/errgroup.(*Group).Go
	/home/runner/go/pkg/mod/golang.org/x/[email protected]/errgroup/errgroup.go:75 +0xa5
```	

Running on Armada 0.4.48

The postgres DB is showing zero runs with a zero'd UUID

postgres=# SELECT * from runs where run_id = '00000000-0000-0000-0000-000000000000'; run_id | job_id | created | job_set | executor | node | cancelled | running | succeeded | failed | returned | run_attempted | serial | last_modified | leased_timestamp | pending_timestamp | running_timestamp | terminated_timestamp | scheduled_at_priority | preempted | pending | preempted_timestamp | pod_requirements_overlay | preempt_requested | queue --------+--------+---------+---------+----------+------+-----------+---------+-----------+--------+----------+---------------+--------+---------------+------------------+------------------- +-------------------+----------------------+-----------------------+-----------+---------+---------------------+--------------------------+-------------------+------- (0 rows)

postgres=# SELECT * from runs where run_id = NULL; run_id | job_id | created | job_set | executor | node | cancelled | running | succeeded | failed | returned | run_attempted | serial | last_modified | leased_timestamp | pending_timestamp | running_timestamp | terminated_timestamp | scheduled_at_priority | preempted | pending | preempted_timestamp | pod_requirements_overlay | preempt_requested | queue --------+--------+---------+---------+----------+------+-----------+---------+-----------+--------+----------+---------------+--------+---------------+------------------+------------------- +-------------------+----------------------+-----------------------+-----------+---------+---------------------+--------------------------+-------------------+------- (0 rows)


So at this point I wonder if there is a decompression/deserialization issue with the node data model.

Sovietaced avatar May 22 '24 20:05 Sovietaced

To provide an update, we discovered this issue was caused by running a v3 and v4 executor in the same cluster. V4 executors assume that job run id labels will be on pods, but they may not be on pods kicked off by v3 executors.

Sovietaced avatar May 24 '24 20:05 Sovietaced