harbor icon indicating copy to clipboard operation
harbor copied to clipboard

EXECUTION_SWEEP job queue is growing

Open wieldes opened this issue 10 months ago • 8 comments

Hello

We are seeing a problem with the growth of the job queue for the EXECUTION_SWEEP job type.

image

logs:

2024-04-09T22:29:53Z [INFO] [/pkg/task/sweep_job.go:91]: start to run sweep job
2024-04-09T22:29:54Z [INFO] [/pkg/task/sweep_job.go:150]: [IMAGE_SCAN] start to sweep, retain latest 1 executions
2024-04-10T00:54:09Z [INFO] [/pkg/task/sweep_job.go:180]: [IMAGE_SCAN] end to sweep, 0 executions were deleted in total, elapsed time: 2h24m14.915320345s
2024-04-10T00:54:09Z [INFO] [/pkg/task/sweep_job.go:160]: [IMAGE_SCAN] listed 0 candidate executions for sweep
2024-04-10T00:54:09Z [INFO] [/pkg/task/sweep_job.go:150]: [P2P_PREHEAT] start to sweep, retain latest 50 executions
2024-04-10T00:54:09Z [INFO] [/pkg/task/sweep_job.go:160]: [P2P_PREHEAT] listed 0 candidate executions for sweep
2024-04-10T00:54:09Z [INFO] [/pkg/task/sweep_job.go:180]: [P2P_PREHEAT] end to sweep, 0 executions were deleted in total, elapsed time: 3.373447ms
2024-04-10T00:54:09Z [INFO] [/pkg/task/sweep_job.go:150]: [RETENTION] start to sweep, retain latest 50 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [RETENTION] listed 90 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [RETENTION] end to sweep, 90 executions were deleted in total, elapsed time: 1.280175086s
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [SCAN_DATA_EXPORT] start to sweep, retain latest 50 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [SCAN_DATA_EXPORT] listed 0 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [SCAN_DATA_EXPORT] end to sweep, 0 executions were deleted in total, elapsed time: 1.892244ms
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [WEBHOOK] start to sweep, retain latest 50 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [WEBHOOK] listed 0 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [WEBHOOK] end to sweep, 0 executions were deleted in total, elapsed time: 1.654203ms
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [EXECUTION_SWEEP] start to sweep, retain latest 10 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [EXECUTION_SWEEP] listed 323 candidate executions for sweep
2024-04-10T00:54:10Z [ERROR] [/pkg/task/sweep_job.go:110]: [EXECUTION_SWEEP] failed to run sweep, error: failed to delete executions: ERROR: update or delete on table "execution" violates foreign key constraint "task_execution_id_fkey" on table "task" (SQLSTATE 23503)
2024-04-10T00:54:10Z [ERROR] [/pkg/task/sweep_job.go:175]: [EXECUTION_SWEEP] failed to batch clean candidates, error: failed to delete executions: ERROR: update or delete on table "execution" violates foreign key constraint "task_execution_id_fkey" on table "task" (SQLSTATE 23503)
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [GARBAGE_COLLECTION] start to sweep, retain latest 50 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [GARBAGE_COLLECTION] listed 1 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [GARBAGE_COLLECTION] end to sweep, 1 executions were deleted in total, elapsed time: 24.991294ms
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [PURGE_AUDIT_LOG] start to sweep, retain latest 10 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [PURGE_AUDIT_LOG] listed 0 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [PURGE_AUDIT_LOG] end to sweep, 0 executions were deleted in total, elapsed time: 1.714995ms
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [REPLICATION] start to sweep, retain latest 50 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [REPLICATION] listed 0 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [REPLICATION] end to sweep, 0 executions were deleted in total, elapsed time: 1.642119ms
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [SCAN_ALL] start to sweep, retain latest 1 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [SCAN_ALL] end to sweep, 0 executions were deleted in total, elapsed time: 4.913568ms
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [SCAN_ALL] listed 0 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [SLACK] start to sweep, retain latest 50 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:160]: [SLACK] listed 0 candidate executions for sweep
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:180]: [SLACK] end to sweep, 0 executions were deleted in total, elapsed time: 1.601537ms
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:150]: [SYSTEM_ARTIFACT_CLEANUP] start to sweep, retain latest 50 executions
2024-04-10T00:54:10Z [INFO] [/pkg/task/sweep_job.go:115]: end to run sweep job

The job cannot delete a job for type EXECUTION_SWEEP in the "task" table because there is no entry for the execution_id.

for example:

registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP';
 count
-------
   341
(1 row)

registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Error';
 count
-------
   219
(1 row)

registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Stopped';
 count
-------
   105
(1 row)

registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Success';
 count
-------
     2
(1 row)

registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Running';
 count
-------
    15
(1 row)

The "task" table does not have entries for executions in the "Error" status for vendor_type = 'EXECUTION_SWEEP'

Q:

  1. How fix it?
  2. Can I remove entries for the EXECUTION_SWEEP job from the "execution" and "task" table myself?

Version v2.9.3-f65d2380

wieldes avatar Apr 10 '24 08:04 wieldes

+ 1

Kot-o-pes avatar Apr 10 '24 08:04 Kot-o-pes

ktkd avatar Apr 10 '24 08:04 ktkd

MrShadow74 avatar Apr 10 '24 08:04 MrShadow74

@wieldes Hi, could you try to execute the above same SQLs on the task table?

chlins avatar Apr 17 '24 03:04 chlins

@chlins

@wieldes Hi, could you try to execute the above same SQLs on the task table?

registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP';
 count
-------
   512
(1 row)

registry=# SELECT COUNT(1) FROM task WHERE vendor_type = 'EXECUTION_SWEEP';
 count
-------
   128
(1 row)

registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Error';
 count
-------
   380
(1 row)

registry=# SELECT COUNT(1) FROM task WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Error';
 count
-------
     1
(1 row)

registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Stopped';
 count
-------
   105
(1 row)

registry=# SELECT COUNT(1) FROM task WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Stopped';
 count
-------
   102
(1 row)

registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Success';
 count
-------
     2
(1 row)

registry=# SELECT COUNT(1) FROM task WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Success';
 count
-------
     0
(1 row)

registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Running';
 count
-------
    25
(1 row)

registry=# SELECT COUNT(1) FROM task WHERE vendor_type = 'EXECUTION_SWEEP' AND status = 'Running';
 count
-------
     0
(1 row)

wieldes avatar Apr 17 '24 10:04 wieldes

I noticed that sweep IMAGE_SCAN executions took long time which caused the high latency for the EXECUTION_SWEEP job.

2024-04-09T22:29:54Z [INFO] [/pkg/task/sweep_job.go:150]: [IMAGE_SCAN] start to sweep, retain latest 1 executions
2024-04-10T00:54:09Z [INFO] [/pkg/task/sweep_job.go:180]: [IMAGE_SCAN] end to sweep, 0 executions were deleted in total, elapsed time: 2h24m14.915320345s

Could you also try to check the size of the of IMAGE_SCAN vendor type includes execution and task table?

chlins avatar Apr 17 '24 13:04 chlins

@chlins

registry=# SELECT COUNT(1) FROM execution WHERE vendor_type = 'IMAGE_SCAN';
  count
---------
 2617000
(1 row)

registry=# SELECT COUNT(1) FROM task WHERE vendor_type = 'IMAGE_SCAN';
  count
---------
 2617126
(1 row)

A lot of records for IMAGE_SCAN....

Some time ago we disabled the automatic image scan.

Will it be safe to remove IMAGE_SCAN records from task and execution tables?

wieldes avatar Apr 18 '24 09:04 wieldes

If you don't need to keep the scan status or report for the artifacts, you can manually delete these records. Even after deletion, you can still re-trigger the image scan to re-analyze vulnerabilities for the artifacts.

chlins avatar May 06 '24 01:05 chlins

In the current situation, the job can't complete in 1 hour, we should adjust the default execution sweep schedule from every hour to every day or editable setting

stonezdj avatar May 27 '24 07:05 stonezdj

In the current situation, the job can't complete in 1 hour, we should adjust the default execution sweep schedule from every hour to every day or editable setting

Editable setting, it sounds good) What do you think about concurrency? Like in GC

wieldes avatar May 27 '24 07:05 wieldes