datahub
datahub copied to clipboard
Synthetic Data Architecture - Standard Pipeline Process
Title
Parent: Synthetic-Data-Architecture Outcome: Standard Pipeline Process
Abstract
Synthetic data processes should have a defined set of steps
-
Analyse data - Classifiers, Identifiers, and Discreet values. - Financial organization have common classifiers (curves, tenors, countries, currencies, etc) - Identifiers to public external entities (LEI, ISIN, CUSIP) - Identifiers to private internal entities (account codes, trading books)
-
Decide on best 'analysis' module (simple 'bucketing', GAN)
-
Parametertise the model (apply noise, fuzziness, generalize/normalize distributions as not to leak sensitive data)
-
Run the model on the production set
-
Use the model-data, and any additional properties to synthetically produce an artificial set
-
Do something with the synthetic dataset
Hey @grovesy,
Should this issue be linked to the Standard Pipeline Process outcome document and is this where the success criteria will be listed?
James.
@grovesy - I have updated the links in the main issue body to reflect the suggestion below.
Should this issue be linked to the Standard Pipeline Process outcome document and is this where the success criteria will be listed?