specification icon indicating copy to clipboard operation
specification copied to clipboard

Add support for ML models

Open stevespringett opened this issue 3 years ago • 10 comments

ML models could theoretically be represented in the inventory. ML is often abstracted behind a service making it easier to consume, but if you wanted to describe the models themselves, I think there may be a way to achieve this.

The thought is to support

  • Supervised
    • Regression
      • Linear
      • Decision tree (continuous)
      • Random forest (continuous)
      • Neural network (continuous)
      • ...
    • Classification
      • Logistic regression
      • Support vector machine
      • Naive bayes
      • Decision tree (discrete)
      • Random forest (discrete)
      • Neural network (discrete)
      • ...
  • Unsupervised
    • Clustering
      • K-means
      • Hierarchical
      • Mean shift
      • Density-based
      • ...
    • Dimensionality reduction
      • Feature elimination
      • Feature extraction
      • Principal Component Analysis (PCA)
      • ...

The BIML Taxonomy of ML attacks has the following categories:

  • input manipulation
  • data manipulation
  • model manipulation
  • input extraction
  • data extraction
  • model extraction

Ideally, CycloneDX support for ML should not only contain ML models, but should ideally be able to communicate potential or confirmed risk within this taxonomy.


Glossary

Abbreviation Description
ML Machine Learning
BIML Berryville Institute of Machine Learning

stevespringett avatar Dec 27 '21 01:12 stevespringett

@stevespringett could you add a glossary? what is "ML" , "BIML" and such? maybe you could edit the initial comment and add links to the terms.

jkowalleck avatar Jan 01 '22 16:01 jkowalleck

I think maybe we should also consider training data sets as components of a ML model.

coderpatros avatar Jan 08 '22 23:01 coderpatros

Training sets, data/model licenses, relevant metrics, external references to the training sets, related artifacts, model cards etc and a way to specify relationships to the software components involved in the training environment would definitely be important. Other attributes maybe domain related. For eg, in NLP, the language that the model is expected to work on is very important. For deep learning models, their architecture is very important.

A good exercise would be to take a look at some existing model stores/model card projects and look at the metadata they capture and see what fits/what is missing.

Some relevant links - https://huggingface.co/docs/hub/model-repos https://github.com/mlflow/mlflow/blob/master/mlflow/store/model_registry/dbmodels/models.py https://modelcards.withgoogle.com/model-reports https://github.com/google/ml-metadata

I suspect we would also need a component type of dataset to fully describe a model.

sambhav avatar Jan 09 '22 09:01 sambhav

Of these all, as an MLE I've leaned towards MLflow in the past because it provides for both model feature/parameter and hyper-parameter tagging. Hyper-parameter configurations are external to the model and cannot be estimated from data - think if I'm tracking both my model performance AND my real-time cloud compute costs, that's the config I need do to that.

Salkimmich avatar Jan 10 '22 15:01 Salkimmich

I think ML support in CDX will be critical in the near future.

Although this bill was just introduced and may or may not pass, there seems to be a clear need for increased transparency into these algorithms. https://www.wyden.senate.gov/news/press-releases/wyden-booker-and-clarke-introduce-algorithmic-accountability-act-of-2022-to-require-new-transparency-and-accountability-for-automated-decision-systems

https://www.congress.gov/bill/117th-congress/senate-bill/3572?q=%7B%22search%22%3A%5B%22cory+booker%22%2C%22cory%22%2C%22booker%22%5D%7D&s=7&r=5

stevespringett avatar Feb 14 '22 16:02 stevespringett

datasets and their provenance is a confirmed use case that needs to be addressed. Datasets also have licenses. Some are "free", others are commercial, etc. So datasets themselves should reuse existing license support.

stevespringett avatar Feb 28 '22 17:02 stevespringett

This might provide a good starting point. https://www.gov.uk/government/collections/algorithmic-transparency-standard

stevespringett avatar Apr 07 '22 15:04 stevespringett

Related thread modeling framework for ML: https://plot4.ai/

stevespringett avatar May 07 '22 18:05 stevespringett

https://github.com/mitre/advmlthreatmatrix

stevespringett avatar Jul 25 '22 15:07 stevespringett

This is more future-facing, but the IBM AI Factsheets are one of the more practical implementations of model fact sheets presently.

chrish42 avatar Aug 17 '22 15:08 chrish42

Update: An updated modelCard view is available in the 1.5 workstreams repo. The data card view will likely tie into data, a new top level property in a BOM supporting low-code/no-code apps among other things.

stevespringett avatar Dec 14 '22 20:12 stevespringett