runbooks Delete data in object store upon Model deletion

Currently we only create them. This could be configurable in the Model (imageDeletionPolicy). The question will be whether to do this with a Job or in the Controller. This will have AuthN/SA implications. I think I prefer to do it in the controller and add GAR permissions to the controller.

Jun 24 '23 12:06 nstogner

I vote for reusing job since that's how image got created as well. That way the controller can have limited permissions and the end user is in control of what permissions are given in each namespace. I guess there are arguments for going either way.

Jun 25 '23 00:06 samos123

One thought is if we need to implement a validation check on whether the image exists in a validating webhook (for Model creation): that would require the controller to have the GAR permissions.

Jun 25 '23 01:06 nstogner

That is a good argument for doing it in the controller. It does make more sense for a simple delete API call to happen in the controller instead of having to spin up a job/pod for that. I agree that doing a small call like this in the controller is generally preferred. Right now there is a hard dependency on GCP so authn should be fine.

Jun 25 '23 03:06 samos123

I've fleshed out the proposal a little further here which would likely negate the need for image deletion. Immutable repo looks incompatible with tagged image deletion anyway.

If we don't use an immutable GAR registry (flexibility is nice without) but we still want to achieve this behavior, I think it'd be preferable to delete the potentially-conflicting tags (what would these even be? semvers?) instead of deleting images. Keeping images and a tag (i.e., long git SHA) that dependably identifies Model repo contents would eliminate lengthy builds of the contents we already have.

Maybe I'm missing a benefit to deletion that y'all can flesh out. FWIW, managing images and tags to me feels like a domain we should confine to jobs and not the controller... I'm struggling to find the right use-case where that really matters but jobs feel like the right security boundary. Potentially:

✅ Multi-tenant internal substratus deployments ✅ hosted SaaS substratus

Jun 28 '23 03:06 brandonjbjelland

You bring up some really good implications of adding authorization to the controller itself around tenant isolation. We need to keep these in mind.

It still seems to me like we would need some sort of check-if-exists validation if we don't overwrite though (which should likely exist in the controller - as it should likely exist as a validating webhook - which would push the controller towards needing admin privileges on the registry). The reason being: I am not sure we can ever fully capture the inputs to a container in a hash value. We could use git shas for source and then hash the training datasets when finetuning, but what about external dependencies at build time? A lot of machine learning projects I have seen pull in the latest versions of libraries. I wonder if we should take this into consideration?

Jun 30 '23 11:06 nstogner

Title changed to reflect the current state of models residing in blob storage.

Aug 01 '23 08:08 brandonjbjelland