harbor
harbor copied to clipboard
Scan performance when task and execution tables in DB have more than 300k rows
What is the problem ? Since our last upgrade to Harbor version 2.5.3, the scheduled full scan with trivy takes a huge amount of time.
With version 2.4.x the full scan operation was executed in 4h to 5h with around 55k images in the registry. With version 2.5.3 the full scan takes 20h to 30h to be executed with the same mount of images.
What is the configuration ? Harbor is running in a Kubernetes cluster (AKS) Trivy has 2 replicas Job service has HPA with 2 to 5 pods based on cpu and ram. Redis is a statefulset with only one pod (Mem limit = 2.4Go) Database is a Azure flexible postgres with 4vCPU and 8Go Ram Storage account is around 9To
All the other components are deployed with HA config and some with HPA.
What we see During the full scan, the DB is 100% vcpu for 10h at the beginning of the scheduled action.
The memory of Redis rise gently to the limit of 2.4Go and it takes around 5h to get there.
Here is the scan_image queue
Investigation on the database with pg_stats_statement
SELECT LEFT(query,60)as query_short, SUM(calls) as tot_calls, SUM(total_exec_time) as tot_time, SUM(total_exec_time)/SUM(calls) as avg_time FROM pg_stat_statements where dbid=24831 GROUP BY query_short ORDER BY tot_time DESC limit 3;
query_short | tot_calls | tot_time | avg_time
--------------------------------------------------------------+-----------+------------+---
select id from task where extra_attrs->>$1 = $2 | 2154 | 2541720 | 1180
SELECT T0."id", T0."cve_id", T0."registration_uuid", T0."pac | 1053 | 595942 | 565
INSERT INTO "report_vulnerability_record" ("report_uuid", "v | 3924 | 147030 | 37
duration is in milliseconds
The query over Task table takes a lot of time with a simple select query.
task table
select vendor_type, count(*) from task group by vendor_type;
vendor_type | count
--------------------+--------
GARBAGE_COLLECTION | 51
IMAGE_SCAN | 183808
REPLICATION | 12129
RETENTION | 80421
SCAN_ALL | 223754
SCHEDULER | 436
The oldest task entry has a start_time in May 2021. It seems that the task table is somehow never cleaned up.
execution table in the data model 1 excution is linked with n task. When you use the ui to scan an image => 1 execution + 1 task are stored in the db When you launch a scheduled full scan => 1 execution + 55000 tasks are stored in the db (our case)
once again we do not see any cleaning of the entries.
A detailed view on the tasks and execution for the scans vendor_type:
WITH task_error as (
select b.execution_id,count(b.id) as cnt from task as b where b.status='Error' group by b.vendor_type, b.execution_id
), task_success as (
select b.execution_id,count(b.id) as cnt from task as b where b.status='Success' group by b.vendor_type, b.execution_id
), task_stopped as (
select b.execution_id,count(b.id) as cnt from task as b where b.status='Stopped' group by b.vendor_type, b.execution_id
)
select a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status,
count(distinct a.id) as nb_execution,
SUM(task_error.cnt) as task_error_cnt,
SUM(task_success.cnt) as task_success_cnt,
SUM(task_stopped.cnt) as task_stopped_cnt
from execution as a
left Join task_error on task_error.execution_id=a.id
left Join task_success on task_success.execution_id=a.id
left Join task_stopped on task_stopped.execution_id=a.id
Where vendor_type like '%SCAN%'
group by a.vendor_type, to_char(a.start_time,'YYYY-MM-DD'), a.status
order by to_char(a.start_time,'YYYY-MM-DD');
vendor_type | to_char | status | nb_execution | task_error_cnt | task_success_cnt | task_stopped_cnt
-------------+------------+---------+--------------+----------------+------------------+------------------
IMAGE_SCAN | 2022-08-01 | Success | 778 | | 778 |
IMAGE_SCAN | 2022-08-02 | Error | 1 | 1 | |
IMAGE_SCAN | 2022-08-02 | Success | 725 | | 725 |
IMAGE_SCAN | 2022-08-03 | Success | 737 | | 737 |
IMAGE_SCAN | 2022-08-04 | Error | 2 | 2 | |
IMAGE_SCAN | 2022-08-04 | Success | 745 | | 745 |
IMAGE_SCAN | 2022-08-05 | Error | 1 | 1 | |
IMAGE_SCAN | 2022-08-05 | Success | 629 | | 629 |
IMAGE_SCAN | 2022-08-06 | Success | 87 | | 87 |
IMAGE_SCAN | 2022-08-07 | Success | 42 | | 42 |
SCAN_ALL | 2022-08-07 | Error | 1 | 12 | 20679 |
IMAGE_SCAN | 2022-08-08 | Success | 544 | | 544 |
IMAGE_SCAN | 2022-08-09 | Success | 574 | | 574 |
IMAGE_SCAN | 2022-08-10 | Success | 599 | | 599 |
IMAGE_SCAN | 2022-08-11 | Success | 554 | | 554 |
IMAGE_SCAN | 2022-08-12 | Success | 678 | | 678 |
IMAGE_SCAN | 2022-08-13 | Success | 70 | | 70 |
IMAGE_SCAN | 2022-08-14 | Success | 39 | | 39 |
SCAN_ALL | 2022-08-14 | Error | 1 | 14 | 21209 |
IMAGE_SCAN | 2022-08-15 | Success | 35 | | 35 |
IMAGE_SCAN | 2022-08-16 | Error | 1 | 1 | |
IMAGE_SCAN | 2022-08-16 | Success | 452 | | 452 |
IMAGE_SCAN | 2022-08-17 | Success | 548 | | 548 |
IMAGE_SCAN | 2022-08-18 | Success | 620 | | 620 |
IMAGE_SCAN | 2022-08-19 | Error | 2 | 2 | |
IMAGE_SCAN | 2022-08-19 | Success | 701 | | 701 |
IMAGE_SCAN | 2022-08-20 | Success | 74 | | 74 |
IMAGE_SCAN | 2022-08-21 | Error | 28 | 28 | |
IMAGE_SCAN | 2022-08-21 | Success | 45 | | 45 |
SCAN_ALL | 2022-08-21 | Error | 1 | 885 | 7657 | 43895
IMAGE_SCAN | 2022-08-22 | Error | 604 | 604 | |
IMAGE_SCAN | 2022-08-22 | Stopped | 1 | | | 1
IMAGE_SCAN | 2022-08-22 | Success | 129 | | 129 |
IMAGE_SCAN | 2022-08-23 | Error | 11 | 11 | |
IMAGE_SCAN | 2022-08-23 | Success | 917 | | 917 |
IMAGE_SCAN | 2022-08-24 | Success | 748 | | 748 |
IMAGE_SCAN | 2022-08-25 | Success | 654 | | 654 |
IMAGE_SCAN | 2022-08-26 | Success | 535 | | 535 |
IMAGE_SCAN | 2022-08-27 | Success | 42 | | 42 |
IMAGE_SCAN | 2022-08-28 | Error | 10 | 10 | |
IMAGE_SCAN | 2022-08-28 | Success | 39 | | 39 |
SCAN_ALL | 2022-08-28 | Error | 1 | 52 | 53508 |
IMAGE_SCAN | 2022-08-29 | Success | 739 | | 739 |
IMAGE_SCAN | 2022-08-30 | Success | 637 | | 637 |
IMAGE_SCAN | 2022-08-31 | Success | 626 | | 626 |
IMAGE_SCAN | 2022-09-01 | Success | 620 | | 620 |
IMAGE_SCAN | 2022-09-02 | Success | 814 | | 814 |
IMAGE_SCAN | 2022-09-03 | Success | 86 | | 86 |
IMAGE_SCAN | 2022-09-04 | Error | 12 | 12 | |
IMAGE_SCAN | 2022-09-04 | Success | 73 | | 73 |
SCAN_ALL | 2022-09-04 | Error | 1 | 210 | 53937 |
IMAGE_SCAN | 2022-09-05 | Error | 5 | 5 | |
IMAGE_SCAN | 2022-09-05 | Success | 700 | | 700 |
IMAGE_SCAN | 2022-09-06 | Success | 624 | | 624 |
IMAGE_SCAN | 2022-09-07 | Success | 858 | | 858 |
IMAGE_SCAN | 2022-09-08 | Error | 1 | 1 | |
IMAGE_SCAN | 2022-09-08 | Success | 891 | | 891 |
IMAGE_SCAN | 2022-09-09 | Success | 568 | | 568 |
IMAGE_SCAN | 2022-09-10 | Success | 87 | | 87 |
IMAGE_SCAN | 2022-09-11 | Error | 2 | 2 | |
IMAGE_SCAN | 2022-09-11 | Success | 79 | | 79 |
SCAN_ALL | 2022-09-11 | Error | 1 | 257 | 54649 |
IMAGE_SCAN | 2022-09-12 | Success | 396 | | 396 |
What is the issue ? When Harbor host a large amount of images, the full scan add a lot of entries in the task table. After 8 full scan the task table have around 400k rows in it (our case) resulting in poor query performance.
As for now we do not want to upgrade the cpu/ram capacities of the DB because it will double the cost of usage of postgres flexible.
What are the questions ?
-
It is not clear how the cleaning is done in the tables task and execution, where can we find the information for the different vendor_type?
-
if there's a cleaning mechanism, how can we check is it working properly, is there some logs somewhere ?
-
It is not clear what harbor does with the previous entries in the tables task and execution, is it safe to clear it ? is it safe to clear all the different vendor_type (see query result above)
-
We did not check with previous version of harbor (ex: 2.4.x) if there's important changes in the database queries, is it related to the bump version of trivy adapter ?
Any help appreciated ^^