GC Can delete that are not supposed to be deleted.
Expected behavior and actual behavior: We have observed that a few times already, In some cases the GC might delete images that were not scheduled to be deleted. The result is that information is present in Harbor UP and DB but not in S3.
It is also observable in the GC Logs that the manifest was deleted.
Steps to reproduce the problem:
In the UI the image is still visible
The image manifest sha is start with d4f9a6cf
#GC LOG
2024-06-04T04:13:09Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:238]: blob eligible for deletion: sha256:d4f9a6cf78a2482148fd3a429c1d2019bf27a3cee1dc74856344a5e03c521585
2024-06-04T04:14:30Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:366]: [108/1438] delete blob from storage: sha256:d4f9a6cf78a2482148fd3a429c1d2019bf27a3cee1dc74856344a5e03c521585
2024-06-04T04:14:30Z [INFO] [/jobservice/job/impl/gc/garbage_collection.go:395]: [108/1438] delete blob record from database: 5040, sha256:d4f9a6cf78a2482148fd3a429c1d2019bf27a3cee1dc74856344a5e03c521585
trying to pull this image results in manifest unknown instead of not found if the images don't exist anymore.
We observed that during the GC run there have been some other operations going on DB level, indicating the we have run out of DB connections.
```sh
2024-06-04T04:22:39Z [ERROR] [/pkg/notifier/notifier.go:203]: Error occurred when triggering handler *artifact.Handler of topic PUSH_ARTIFACT: failed to connect to `host=harbor-pg-database user=harbor database=harbor`: server error (FATAL: remaining connection slots are reserved for non-replication superuser connections (SQLSTATE 53300))
Versions:
- harbor version: 2.7.x, 2.9.x, 2.10.
Additional context:
- Log files: No other errors in the logs besides DB SQLSTATE 53300
- GC Job completed successfully
maybe related to https://github.com/beego/beego/issues/5255 resolved by https://github.com/goharbor/harbor/pull/20452
Similar issue: https://github.com/goharbor/harbor/issues/19401
The issue may be caused by the beego ORM, as it doesn't carry errors during data scanning. In some extreme cases, such as when a connection is out of use, the ORM returns incorrect data, leading to wrong blob deletion candidates. We're working on upgrading Beego with this pull request. https://github.com/goharbor/harbor/pull/20555
To mitigate the issue, you can now schedule garbage collection during low usage time slots.
maybe related to beego/beego#5255 resolved by #20452
We have this PR for it:
- https://github.com/goharbor/harbor/pull/20555
This issue is being marked stale due to a period of inactivity. If this issue is still relevant, please comment or remove the stale label. Otherwise, this issue will close in 30 days.
@Vad1mo, does the Harbor 2.12 with the fix solve this issue for you? We are experiencing the same issue, and we have lots of containers broken because of missing layers.
@hudymi yes, it's in the v2.12. Can you confirm whether the missing layers are removed by GC? If so, could you also confirm that the removed digests do not belong to any artifacts in use at the time of GC execution?
@wy65701436 are there any steps that I should follow to check the compatibility? Our check was simply based on docker pull command and then checking if missing layer is in S3 (where it was missing)
@hudymi, did you launch the GC with the "Allow garbage collection on untagged artifacts" option ?
@prgss yes
One thing. We are on Harbor 2.11, and I asked if 2.12 fixes the problem, so can we reenable GC after the upgrade.
@wy65701436 We upgraded from 2.10.3 to 2.12.2 and GC deleted images that had valid tags.