specification
specification copied to clipboard
Add support for ML models
ML models could theoretically be represented in the inventory. ML is often abstracted behind a service making it easier to consume, but if you wanted to describe the models themselves, I think there may be a way to achieve this.
The thought is to support
- Supervised
- Regression
- Linear
- Decision tree (continuous)
- Random forest (continuous)
- Neural network (continuous)
- ...
- Classification
- Logistic regression
- Support vector machine
- Naive bayes
- Decision tree (discrete)
- Random forest (discrete)
- Neural network (discrete)
- ...
- Regression
- Unsupervised
- Clustering
- K-means
- Hierarchical
- Mean shift
- Density-based
- ...
- Dimensionality reduction
- Feature elimination
- Feature extraction
- Principal Component Analysis (PCA)
- ...
- Clustering
The BIML Taxonomy of ML attacks has the following categories:
- input manipulation
- data manipulation
- model manipulation
- input extraction
- data extraction
- model extraction
Ideally, CycloneDX support for ML should not only contain ML models, but should ideally be able to communicate potential or confirmed risk within this taxonomy.
Glossary
| Abbreviation | Description |
|---|---|
| ML | Machine Learning |
| BIML | Berryville Institute of Machine Learning |
@stevespringett could you add a glossary? what is "ML" , "BIML" and such? maybe you could edit the initial comment and add links to the terms.
I think maybe we should also consider training data sets as components of a ML model.
Training sets, data/model licenses, relevant metrics, external references to the training sets, related artifacts, model cards etc and a way to specify relationships to the software components involved in the training environment would definitely be important. Other attributes maybe domain related. For eg, in NLP, the language that the model is expected to work on is very important. For deep learning models, their architecture is very important.
A good exercise would be to take a look at some existing model stores/model card projects and look at the metadata they capture and see what fits/what is missing.
Some relevant links - https://huggingface.co/docs/hub/model-repos https://github.com/mlflow/mlflow/blob/master/mlflow/store/model_registry/dbmodels/models.py https://modelcards.withgoogle.com/model-reports https://github.com/google/ml-metadata
I suspect we would also need a component type of dataset to fully describe a model.
Of these all, as an MLE I've leaned towards MLflow in the past because it provides for both model feature/parameter and hyper-parameter tagging. Hyper-parameter configurations are external to the model and cannot be estimated from data - think if I'm tracking both my model performance AND my real-time cloud compute costs, that's the config I need do to that.
I think ML support in CDX will be critical in the near future.
Although this bill was just introduced and may or may not pass, there seems to be a clear need for increased transparency into these algorithms. https://www.wyden.senate.gov/news/press-releases/wyden-booker-and-clarke-introduce-algorithmic-accountability-act-of-2022-to-require-new-transparency-and-accountability-for-automated-decision-systems
https://www.congress.gov/bill/117th-congress/senate-bill/3572?q=%7B%22search%22%3A%5B%22cory+booker%22%2C%22cory%22%2C%22booker%22%5D%7D&s=7&r=5
datasets and their provenance is a confirmed use case that needs to be addressed. Datasets also have licenses. Some are "free", others are commercial, etc. So datasets themselves should reuse existing license support.
This might provide a good starting point. https://www.gov.uk/government/collections/algorithmic-transparency-standard
Related thread modeling framework for ML: https://plot4.ai/
https://github.com/mitre/advmlthreatmatrix
This is more future-facing, but the IBM AI Factsheets are one of the more practical implementations of model fact sheets presently.
Update: An updated modelCard view is available in the 1.5 workstreams repo. The data card view will likely tie into data, a new top level property in a BOM supporting low-code/no-code apps among other things.