[Bug]: Significant performance hit due to excessive image store lock
zot version
v2.1.2
Describe the bug
We've identified a significant performance bottleneck in the image store layer due to a global lock. This lock appears to be causing excessive contention and dramatically impacting overall system performance. Especially during GC period
To reproduce
- Enable GC and image retention
- Create a lot of ephemeral image
- Observe any request took 10s latency
Expected behavior
Single lock per repo
Screenshots
No response
Additional context
Code reference: https://github.com/project-zot/zot/blob/main/pkg/storage/imagestore/imagestore.go#L45
Image store use a single mutex object and it is called a lot
@vanhtuan0409 could you comment on what your scale is?
Yes, we are aware of some of these limitations, and there is work to address/mitigate this.
If you insist on keeping a single instance zot, https://github.com/project-zot/zot/pull/2600
else, https://zotregistry.dev/v2.1.2/articles/scaleout/
If you insist on keeping a single instance zot, #2600
That is an experiment I did not get to finish. It was getting stuck, probably a deadlock.
@vanhtuan0409 is this something you started experiencing in v2.1.2? That lock was part of the implementation for a very long time.
What storage type and cache driver are you using? There's also https://github.com/project-zot/zot/issues/2946.
@rchincha @andaaron I do have a single node setup with 80% of repositories are ephemeral. Storage driver is s3 and cache driver is dynamodb
To my understanding, if we scale up the Zot service. Each instance will still perform GC scan through all repositories. Repositories clean up will render Zot service unusable due to the write lock on Image Storage
@andaaron I believe this issue has manifested in the code base for a very long time. Due to the nature, our application uses mostly ephemeral repositories and cant disable deduplication
could you comment on what your scale is?
How many repositories? How many images? How big are those images? How many concurrent clients? How many requests per sec? ...etc
Approx 400 repositories. Each repositories have 1-5 tags. 90% of those are ephemeral Around 30 concurrent clients with 1-2 rps Latency currently is up to 10s
https://zotregistry.dev/v2.1.2/articles/pprofiling
If possible can you generate a profile/flamegraph for us.
Also, pls do share the zot config (anonymized)
@vanhtuan0409 zot also exports a lot of opentelemetry metrics (prometheus, etc)
@vanhtuan0409 can you pls verify if your situation improves after this PR https://github.com/project-zot/zot/pull/2968?
I am in the middle of some ongoing project. Lets me take a few days to verify
Hey, any progress on this? It seems we're hitting the same issue on 2.1.5.
https://github.com/project-zot/zot/pull/3226
^ this should address the main bottlenecks.
This issue is stale because it has been open 90 days with no activity. Remove stale label or comment or this will be closed in 5 days.