harbor
harbor copied to clipboard
EXECUTION_SWEEP job queue is growing
Hello
We are seeing a problem with the growth of the job queue for the EXECUTION_SWEEP job type.
logs:
2024-04-09T22:29:53Z [INFO] [/pkg/task/sweep_job.go:91]: start to run sweep job
2024-04-09T22:29:54Z [INFO] [/pkg/task/sweep_job.go:150]: [IMAGE_SCAN] start to sweep, retain latest 1 executions
2024-04-10T00:54:09Z [INFO] [/pkg/task/sweep_job.go:180]: [IMAGE_SCAN] end to sweep, 0 executions were deleted in total, elapsed time: 2h24m14.915320345s
2024-04-10T00:54:09Z [INFO] [/pkg/task/sweep_job.go:160]: [IMAGE_SCAN] listed 0 candidate executions for sweep
2024-04-10T00:54:09Z [INFO] [/pkg/task/sweep_job.go:150]: [P2P_PREHEAT] start to sweep, retain latest 50 executions
2024-04-10T00:54:09Z [INFO] [/pkg/task/sweep_job.go:160]: [P2P_PREHEAT] listed 0 candidate executions for sweep
2024-04-10T00:54:09Z [INFO] [/pkg/task/sweep_job.go:180]: [P2P_PREHEAT] end to sweep, 0 executions were deleted in total, elapsed time: 3.373447ms
2024-04-10T00:54:09Z [INFO] [/pkg/task/sweep_job.go:150]: [RETENTION] start to sweep, retain latest 50 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [RETENTION] listed 90 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [RETENTION] end to sweep, 90 executions were deleted in total, elapsed time: 1.280175086s
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [SCAN_DATA_EXPORT] start to sweep, retain latest 50 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [SCAN_DATA_EXPORT] listed 0 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [SCAN_DATA_EXPORT] end to sweep, 0 executions were deleted in total, elapsed time: 1.892244ms
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [WEBHOOK] start to sweep, retain latest 50 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [WEBHOOK] listed 0 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [WEBHOOK] end to sweep, 0 executions were deleted in total, elapsed time: 1.654203ms
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [EXECUTION_SWEEP] start to sweep, retain latest 10 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [EXECUTION_SWEEP] listed 323 candidate executions for sweep
2024-04-10T00:54:10Z [ERROR] [/pkg/task/sweep_job.go:110]: [EXECUTION_SWEEP] failed to run sweep, error: failed to delete executions: ERROR: update or delete on table "execution" violates foreign key constraint "task_execution_id_fkey" on table "task" (SQLSTATE 23503)
2024-04-10T00:54:10Z [ERROR] [/pkg/task/sweep_job.go:175]: [EXECUTION_SWEEP] failed to batch clean candidates, error: failed to delete executions: ERROR: update or delete on table "execution" violates foreign key constraint "task_execution_id_fkey" on table "task" (SQLSTATE 23503)
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [GARBAGE_COLLECTION] start to sweep, retain latest 50 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [GARBAGE_COLLECTION] listed 1 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [GARBAGE_COLLECTION] end to sweep, 1 executions were deleted in total, elapsed time: 24.991294ms
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [PURGE_AUDIT_LOG] start to sweep, retain latest 10 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [PURGE_AUDIT_LOG] listed 0 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [PURGE_AUDIT_LOG] end to sweep, 0 executions were deleted in total, elapsed time: 1.714995ms
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [REPLICATION] start to sweep, retain latest 50 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [REPLICATION] listed 0 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [REPLICATION] end to sweep, 0 executions were deleted in total, elapsed time: 1.642119ms
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [SCAN_ALL] start to sweep, retain latest 1 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [SCAN_ALL] end to sweep, 0 executions were deleted in total, elapsed time: 4.913568ms
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [SCAN_ALL] listed 0 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [SLACK] start to sweep, retain latest 50 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [SLACK] listed 0 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [SLACK] end to sweep, 0 executions were deleted in total, elapsed time: 1.601537ms
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [SYSTEM_ARTIFACT_CLEANUP] start to sweep, retain latest 50 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:115]: end to run sweep job
The job cannot delete a job for type EXECUTION_SWEEP in the "task" table because there is no entry for the execution_id.
for example:
registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP';
count
-------
341
(1 row)
registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Error';
count
-------
219
(1 row)
registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Stopped';
count
-------
105
(1 row)
registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Success';
count
-------
2
(1 row)
registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Running';
count
-------
15
(1 row)
The "task" table does not have entries for executions in the "Error" status for vendor_type = 'EXECUTION_SWEEP'
Q:
- How fix it?
- Can I remove entries for the EXECUTION_SWEEP job from the "execution" and "task" table myself?
Version v2.9.3-f65d2380
+ 1
@wieldes Hi, could you try to execute the above same SQLs on the task table?
@chlins
@wieldes Hi, could you try to execute the above same SQLs on the task table?
registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP';
count
-------
512
(1 row)
registry=# SELECT COUNT(1) FROM task WHERE vendor_type = 'EXECUTION_SWEEP';
count
-------
128
(1 row)
registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Error';
count
-------
380
(1 row)
registry=# SELECT COUNT(1) FROM task WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Error';
count
-------
1
(1 row)
registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Stopped';
count
-------
105
(1 row)
registry=# SELECT COUNT(1) FROM task WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Stopped';
count
-------
102
(1 row)
registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Success';
count
-------
2
(1 row)
registry=# SELECT COUNT(1) FROM task WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Success';
count
-------
0
(1 row)
registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Running';
count
-------
25
(1 row)
registry=# SELECT COUNT(1) FROM task WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Running';
count
-------
0
(1 row)
I noticed that sweep IMAGE_SCAN executions took long time which caused the high latency for the EXECUTION_SWEEP job.
2024-04-09T22:29:54Z [INFO] [/pkg/task/sweep_job.go:150]: [IMAGE_SCAN] start to sweep, retain latest 1 executions
2024-04-10T00:54:09Z [INFO] [/pkg/task/sweep_job.go:180]: [IMAGE_SCAN] end to sweep, 0 executions were deleted in total, elapsed time: 2h24m14.915320345s
Could you also try to check the size of the of IMAGE_SCAN vendor type includes execution and task table?
@chlins
registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'IMAGE_SCAN';
count
---------
2617000
(1 row)
registry=# SELECT COUNT(1) FROM task WHERE vendor_type = 'IMAGE_SCAN';
count
---------
2617126
(1 row)
A lot of records for IMAGE_SCAN....
Some time ago we disabled the automatic image scan.
Will it be safe to remove IMAGE_SCAN records from task and execution tables?
If you don't need to keep the scan status or report for the artifacts, you can manually delete these records. Even after deletion, you can still re-trigger the image scan to re-analyze vulnerabilities for the artifacts.
In the current situation, the job can't complete in 1 hour, we should adjust the default execution sweep schedule from every hour to every day or editable setting
In the current situation, the job can't complete in 1 hour, we should adjust the default execution sweep schedule from every hour to every day or editable setting
Editable setting, it sounds good) What do you think about concurrency? Like in GC