Advanced filtering for unpack
Describe the problem you're trying to solve It should be possible to unpack only named artifacts from an artifact group. For instance, it should be possible to unpack only README.md from docs.
Describe the solution you'd like
Add a new flag to unpack command like --filter the value of the filter should be able to indicate the artifact type and a name/patch to match. Any names/paths that partially or fully to the filter should be extracted.
And example would be kit unpack --filter=code:*config.* which would extract all the files that has "config." in the path names
- The flag should support glob patterns for flexibility.
- Ensure that the implementation is backward compatible. It should be possible to do
--filter:docs:*to extract all docs layers.
Do we want the filter to work per-layer, or per-file? Using the yet to be merged docs layers as an example, if we have a kitfile
docs
- name: model-documentation
path: docs/
can my glob (--filter=docs:README*) extract all readme files? How do we handle deeply-nested files?
My concern in this case is that we're writing a fairly complicated filter spec to handle files that the user is not necessarily familiar with (what are the filenames of files that are of interest to me? Is it named README.md or readme.md?). If I hand you a modelkit, will you be able to meaningfully use filters like this to get what you want?
My initial conception of this sort of feature would be that it works more on layers: you have to include more context in the kitfile, but if you have something like
docs:
- name: main-documentation
path: docs/
- name: readme
path: README.md
- name: changelog
path: CHANGELOG.md
you could use filters as follows:
--filter=docsto extract all docs layers--filter=docs:readmeto extract only the readme- Something like
--filter=docs:readme,changelogto extract the readme and changelog
This is simpler but has some benefits:
- Kit doesn't need to unpack all layers in order to do the unpack (with filepath globs, we'd have to look in the big main-documentation layer even if you just want the readme)
- It's arguably more useful for sharing modelkits, as you can at-a-glance see what's relevant for extracting
- It's simpler to understand and doesn't require understanding how unix-style file globbing works.
- It avoids some strange edge cases that are otherwise tricky (e.g. unpacking
some/deeply/nested/README.mdthat references other files and is broken if you only have it)
@amisevsk I like your simplified filter idea based on name.
I was thinking about this in relation to datasets where I might have training data files but also have a hyperparameter JSON file that is used during inference. The training files could be really large but the JSON file would like be really small.
I would want the most efficient way possible to unpack only the params.
datasets:
- description: training data
name: training-data
path: ./data/forum-to-2023-train.csv
- description: validation data
name: validation-data
path: ./data/test.csv
- description: hyperparameters
name: params-config
path: ./data/params.json
// alternatively
- description: hyperparameters
name: params-config
path: ./data/params.json
mediaType: application/vnd.myorg.oci.params.v1.tar
Would there need to be a way to specify if a Kitfile entry gets it's own entry in ModelKit's manifest under layers?
Maybe we could assign a custom mediaType and only unpack that type.
--filter=mediatype:application/vnd.myorg.oci.params.v1.tar
@rmtuckerphx I have considered custom media types before but communicating them to a person who had just received a modelkit felt like more trouble than it's worth.
@amisevsk I like the idea of simpler filtering without the globs. If I understand correctly, it means the filter will match to the path exactly? For instance --filter=docs:README.md,some/deeply/nested/README.md would be result extracting 2 files.
@gorkem Custom media types might not be something that beginners choose to use but I know exactly how I'd use it and the fact that I can unpack only a specific, named mediaType is so powerful.
Here is my idea for custom media types that fits with the simplified named filter idea: --filter=media:<name>
Add a new media section where each section can include a list of files:
media
- description: files needed by service calling infer on model
name: service
mediaType: application/vnd.myorg.oci.service.v1.tar # custom media type
parts:
- description: hyperparameters
path: ./service/params.json
- description: another file
path: ./path/to/files
I find the idea of custom media types interesting, we should split that into a separate issue for consideration. It might make sense as an advanced user feature, but I worry that it would introduce some complexity/strangeness and might be a non-goal -- ultimately, Kit will have to handle custom media types identically to some other media type (if it can even parse them correctly), so they're not meaningfully unique media types, in a functional sense. Using media types in this way feels like a potential misuse of what a media type is intended to convey (it's for the machine to know how to handle the data, not for the user). In other words, can this functionality be captured in a sort of "group" overlay on top of existing layers?
If I understand correctly, it means the filter will match to the path exactly? For instance --filter=docs:README.md,some/deeply/nested/README.md would be result extracting 2 files.
I'm thinking more restricted than that; you can't filter for actual files inside the tarballs, you can only filter for entire layers (and it's on the creator of the Kitfile to make those layers meaningful). For the hyperparameters example above
datasets:
- description: training data
name: training-data
path: ./data/forum-to-2023-train.csv
- description: validation data
name: validation-data
path: ./data/test.csv
- description: hyperparameters
name: params-config
path: ./data/params.json
// alternatively
- description: hyperparameters
name: params-config
path: ./data/params.json
mediaType: application/vnd.myorg.oci.params.v1.tar
you would use kit unpack --filter=datasets:params-config to get just the hyperparameters layer, with everything it includes.
This approach saves us work (we wouldn't even have to download the other dataset layers). With something like --filter=docs:README.md,some/deeply/nested/README.md, there's no great way of knowing which docs layer might contain some/deeply/nested/README.md, so we'd effectively have to pull and unpack all docs layers and discard 90% of the work.
@amisevsk My understanding was that all datasets were included in the same layer. If each named dataset is a separate layer and I can get to my 2k params-config quickly without having to download (worry about) my 550MB training-data then I'm fine.
I proved to myself that 2 data files will get 2 layers:
Kitfile
manifestVersion: v1.0.0
package:
name: rain-kit-test
version: 1.0.0
description: Kit Test
docs:
- path: ./README.md
description: Readme file
datasets:
- name: train-data
path: data/train.json
- name: config
path: data/config.json
$ kit inspect myreg/rain-kit-test:1.0.1
{
"digest": "sha256:281f07e19706860a54ae452e4e30488bf816eec8a98cbd1f9b9d05801be78884",
"cliVersion": "0.3.0-1653de5",
"kitfile": {
"manifestVersion": "v1.0.0",
"package": {
"name": "rain-kit-test",
"version": "1.0.0",
"description": "Kit Test"
},
"datasets": [
{
"name": "train-data",
"path": "data/train.json"
},
{
"name": "config",
"path": "data/config.json"
}
],
"docs": [
{
"path": "./README.md",
"description": "Readme file"
}
]
},
"manifest": {
"schemaVersion": 2,
"config": {
"mediaType": "application/vnd.kitops.modelkit.config.v1+json",
"digest": "sha256:f0745adda55d9f5abea9f689cb35e02dc9c7e1159e550169fdda78d6ddcb804b",
"size": 270
},
"layers": [
{
"mediaType": "application/vnd.kitops.modelkit.dataset.v1.tar",
"digest": "sha256:e0d71d869f0d196d8d97716424a5649fb7136ab43322e2c0f1687ae732a26df9",
"size": 2048
},
{
"mediaType": "application/vnd.kitops.modelkit.dataset.v1.tar",
"digest": "sha256:1d770b1f7a362e8cb2245ef1e18a06ef81d7378efb0a3d8f70dfe63960e154b0",
"size": 2048
},
{
"mediaType": "application/vnd.kitops.modelkit.docs.v1.tar",
"digest": "sha256:8ddea24d6194ead2e0532475b99eb343670e83496e57708842af9ab8507305c4",
"size": 2048
}
],
"annotations": {
"ml.kitops.modelkit.cli-version": "0.3.0-1653de5"
}
}
}
Is there a way to only pull from the registry the specific dataset (layer) for config? That would prevent network traffic of extra layers.
Let's say that I have a model, train data, test data and config all in the same ModelKit stored in a registry. For the data scientists on my team, they care about model, train data and test data. For deployment, there is a service that dynamically loads the model from the registry when it starts and also needs the config. The service doesn't care about the large train data file used during model training. It only cares about the model.
Should there be 2 separage ModelKits in the registry for each of those purposes with only the files needed for the context or can we use a single ModelKit?
This enhancement request is exactly for the issue you are pointing to. The current filter implementation only allows us to download all layers of the same media type (namely docs, code, model, dataset). This enhancement is to introduce an advanced filtering to select one or more layers with the same media type.
Sorry, I was away for a few weeks. You're right that each entry in a Kitfile (dataset, code, docs, or model) is saved as a separate layer in the modelkit (so each element can be downloaded independently). However, for our initial implementation for kit unpack filters, we restricted the filters based only on type (e.g. all datasets, all code, etc.). While this works great for "I want just the model and none of the code that comes with it", it would be great to also allow "I just want this dataset instead of all of them" as in your example.