cortex icon indicating copy to clipboard operation
cortex copied to clipboard

Consider adding in-cluster image caching layer

Open deliahu opened this issue 4 years ago • 4 comments
trafficstars

The primary motivation for this would be performance and reducing load on an external image registry. One thing to test before investing in it is just how much faster would it be vs ECR in the same region.

We would need to make sure to do it in a way to not jeopardize reliability, so perhaps we can add a new node for this (which if it goes down, we can fall back on the remote registry).

Also related: we looked into supporting specifying backup image registries (#1995), and ran into a bit of a roadblock, but perhaps it’s gotten more feasible since then.

deliahu avatar May 20 '21 17:05 deliahu

It would be very interesting to see if this is indeed faster. We use fairly large images (average around 16GB) and the network overhead takes a heavy load on startup

creatorrr avatar Aug 26 '21 20:08 creatorrr

@creatorrr just a passing thought: depending on whether you run Python code in your container or not, you might be able to reduce the size of your image considerably with a tool like https://github.com/google/subpar. Haven't used it yet, just stumbled upon it a few days ago. 16GB is a heck of a lot!

RobertLucian avatar Aug 26 '21 20:08 RobertLucian

Yup yup. Basically because we bundle ML models within the image itself so it’s not the runtime that’s causing the bloat (although that could use some optimisation too). We have tried using better compression but that didn’t yield significant benefits.

creatorrr avatar Aug 27 '21 01:08 creatorrr

By the way, what’s the best way to measure time spent by the service in different stages during startup? Just looking at the logs?

creatorrr avatar Aug 27 '21 02:08 creatorrr