datahub icon indicating copy to clipboard operation
datahub copied to clipboard

Synthetic Data Architecture - Standard Pipeline Process

Open grovesy opened this issue 4 years ago • 2 comments

Title

Parent: Synthetic-Data-Architecture Outcome: Standard Pipeline Process

Abstract

Synthetic data processes should have a defined set of steps

  • Analyse data - Classifiers, Identifiers, and Discreet values.   - Financial organization have common classifiers (curves, tenors, countries, currencies, etc)   - Identifiers to public external entities (LEI, ISIN, CUSIP)   - Identifiers to private internal entities (account codes, trading books)

  • Decide on best 'analysis' module (simple 'bucketing', GAN)

  • Parametertise the model (apply noise, fuzziness, generalize/normalize distributions as not to leak sensitive data)

  • Run the model on the production set

  • Use the model-data, and any additional properties to synthetically produce an artificial set

  • Do something with the synthetic dataset

grovesy avatar Oct 13 '20 11:10 grovesy

Hey @grovesy,

Should this issue be linked to the Standard Pipeline Process outcome document and is this where the success criteria will be listed?

James.

mcleo-d avatar Oct 14 '20 11:10 mcleo-d

@grovesy - I have updated the links in the main issue body to reflect the suggestion below.

Should this issue be linked to the Standard Pipeline Process outcome document and is this where the success criteria will be listed?

mcleo-d avatar Oct 14 '20 11:10 mcleo-d