specification Add support for ML models

ML models could theoretically be represented in the inventory. ML is often abstracted behind a service making it easier to consume, but if you wanted to describe the models themselves, I think there may be a way to achieve this.

The thought is to support

Supervised
- Regression
  - Linear
  - Decision tree (continuous)
  - Random forest (continuous)
  - Neural network (continuous)
  - ...
- Classification
  - Logistic regression
  - Support vector machine
  - Naive bayes
  - Decision tree (discrete)
  - Random forest (discrete)
  - Neural network (discrete)
  - ...
Unsupervised
- Clustering
  - K-means
  - Hierarchical
  - Mean shift
  - Density-based
  - ...
- Dimensionality reduction
  - Feature elimination
  - Feature extraction
  - Principal Component Analysis (PCA)
  - ...

The BIML Taxonomy of ML attacks has the following categories:

input manipulation
data manipulation
model manipulation
input extraction
data extraction
model extraction

Ideally, CycloneDX support for ML should not only contain ML models, but should ideally be able to communicate potential or confirmed risk within this taxonomy.

Glossary

Abbreviation	Description
ML	Machine Learning
BIML	Berryville Institute of Machine Learning

Dec 27 '21 01:12 stevespringett

@stevespringett could you add a glossary? what is "ML" , "BIML" and such? maybe you could edit the initial comment and add links to the terms.

Jan 01 '22 16:01 jkowalleck

I think maybe we should also consider training data sets as components of a ML model.

Jan 08 '22 23:01 coderpatros

Training sets, data/model licenses, relevant metrics, external references to the training sets, related artifacts, model cards etc and a way to specify relationships to the software components involved in the training environment would definitely be important. Other attributes maybe domain related. For eg, in NLP, the language that the model is expected to work on is very important. For deep learning models, their architecture is very important.

A good exercise would be to take a look at some existing model stores/model card projects and look at the metadata they capture and see what fits/what is missing.

Some relevant links - https://huggingface.co/docs/hub/model-repos https://github.com/mlflow/mlflow/blob/master/mlflow/store/model_registry/dbmodels/models.py https://modelcards.withgoogle.com/model-reports https://github.com/google/ml-metadata

I suspect we would also need a component type of dataset to fully describe a model.

Jan 09 '22 09:01 sambhav

Of these all, as an MLE I've leaned towards MLflow in the past because it provides for both model feature/parameter and hyper-parameter tagging. Hyper-parameter configurations are external to the model and cannot be estimated from data - think if I'm tracking both my model performance AND my real-time cloud compute costs, that's the config I need do to that.

Jan 10 '22 15:01 Salkimmich

I think ML support in CDX will be critical in the near future.

Although this bill was just introduced and may or may not pass, there seems to be a clear need for increased transparency into these algorithms. https://www.wyden.senate.gov/news/press-releases/wyden-booker-and-clarke-introduce-algorithmic-accountability-act-of-2022-to-require-new-transparency-and-accountability-for-automated-decision-systems

https://www.congress.gov/bill/117th-congress/senate-bill/3572?q=%7B%22search%22%3A%5B%22cory+booker%22%2C%22cory%22%2C%22booker%22%5D%7D&s=7&r=5

Feb 14 '22 16:02 stevespringett

datasets and their provenance is a confirmed use case that needs to be addressed. Datasets also have licenses. Some are "free", others are commercial, etc. So datasets themselves should reuse existing license support.

Feb 28 '22 17:02 stevespringett

This might provide a good starting point. https://www.gov.uk/government/collections/algorithmic-transparency-standard

Apr 07 '22 15:04 stevespringett

Related thread modeling framework for ML: https://plot4.ai/

May 07 '22 18:05 stevespringett

https://github.com/mitre/advmlthreatmatrix

Jul 25 '22 15:07 stevespringett

This is more future-facing, but the IBM AI Factsheets are one of the more practical implementations of model fact sheets presently.

Aug 17 '22 15:08 chrish42

Update: An updated modelCard view is available in the 1.5 workstreams repo. The data card view will likely tie into data, a new top level property in a BOM supporting low-code/no-code apps among other things.

Dec 14 '22 20:12 stevespringett

specification specification copied to clipboard

Add support for ML models

Glossary

specification
specification copied to clipboard