zot icon indicating copy to clipboard operation
zot copied to clipboard

[Bug]: Significant performance hit due to excessive image store lock

Open vanhtuan0409 opened this issue 10 months ago • 14 comments

zot version

v2.1.2

Describe the bug

We've identified a significant performance bottleneck in the image store layer due to a global lock. This lock appears to be causing excessive contention and dramatically impacting overall system performance. Especially during GC period

To reproduce

  1. Enable GC and image retention
  2. Create a lot of ephemeral image
  3. Observe any request took 10s latency

Expected behavior

Single lock per repo

Screenshots

No response

Additional context

Code reference: https://github.com/project-zot/zot/blob/main/pkg/storage/imagestore/imagestore.go#L45

Image store use a single mutex object and it is called a lot

vanhtuan0409 avatar Feb 14 '25 03:02 vanhtuan0409

@vanhtuan0409 could you comment on what your scale is?

Yes, we are aware of some of these limitations, and there is work to address/mitigate this.

If you insist on keeping a single instance zot, https://github.com/project-zot/zot/pull/2600

else, https://zotregistry.dev/v2.1.2/articles/scaleout/

rchincha avatar Feb 14 '25 06:02 rchincha

If you insist on keeping a single instance zot, #2600

That is an experiment I did not get to finish. It was getting stuck, probably a deadlock.

andaaron avatar Feb 14 '25 07:02 andaaron

@vanhtuan0409 is this something you started experiencing in v2.1.2? That lock was part of the implementation for a very long time.

What storage type and cache driver are you using? There's also https://github.com/project-zot/zot/issues/2946.

andaaron avatar Feb 14 '25 07:02 andaaron

@rchincha @andaaron I do have a single node setup with 80% of repositories are ephemeral. Storage driver is s3 and cache driver is dynamodb

To my understanding, if we scale up the Zot service. Each instance will still perform GC scan through all repositories. Repositories clean up will render Zot service unusable due to the write lock on Image Storage

vanhtuan0409 avatar Feb 14 '25 08:02 vanhtuan0409

@andaaron I believe this issue has manifested in the code base for a very long time. Due to the nature, our application uses mostly ephemeral repositories and cant disable deduplication

vanhtuan0409 avatar Feb 14 '25 08:02 vanhtuan0409

could you comment on what your scale is?

How many repositories? How many images? How big are those images? How many concurrent clients? How many requests per sec? ...etc

rchincha avatar Feb 14 '25 16:02 rchincha

Approx 400 repositories. Each repositories have 1-5 tags. 90% of those are ephemeral Around 30 concurrent clients with 1-2 rps Latency currently is up to 10s

vanhtuan0409 avatar Feb 17 '25 03:02 vanhtuan0409

https://zotregistry.dev/v2.1.2/articles/pprofiling

If possible can you generate a profile/flamegraph for us.

Also, pls do share the zot config (anonymized)

rchincha avatar Feb 17 '25 06:02 rchincha

@vanhtuan0409 zot also exports a lot of opentelemetry metrics (prometheus, etc)

rchincha avatar Feb 18 '25 05:02 rchincha

@vanhtuan0409 can you pls verify if your situation improves after this PR https://github.com/project-zot/zot/pull/2968?

rchincha avatar Feb 27 '25 07:02 rchincha

I am in the middle of some ongoing project. Lets me take a few days to verify

vanhtuan0409 avatar Feb 27 '25 07:02 vanhtuan0409

Hey, any progress on this? It seems we're hitting the same issue on 2.1.5.

piontec avatar Aug 05 '25 13:08 piontec

https://github.com/project-zot/zot/pull/3226

^ this should address the main bottlenecks.

rchincha avatar Aug 06 '25 04:08 rchincha

This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.

github-actions[bot] avatar Nov 05 '25 02:11 github-actions[bot]