[Phase 3] Implement Adversarial Robustness and AI Safety
Problem
MISSING: AI safety, adversarial robustness, and alignment techniques.
Existing
- Issue #287: Safety filtering
- Basic security policy (SECURITY.md)
Missing Implementations
Adversarial Training (CRITICAL):
- FGSM (Fast Gradient Sign Method)
- PGD (Projected Gradient Descent)
- C&W (Carlini & Wagner)
- AutoAttack
- Adversarial training pipelines
Certified Robustness (HIGH):
- Randomized smoothing
- Interval bound propagation
- CROWN (verification)
- Certified defenses
AI Alignment (HIGH):
- RLHF (Reinforcement Learning from Human Feedback)
- Constitutional AI
- Red teaming frameworks
- Critique models
Safety Infrastructure (HIGH):
- Input validation
- Output filtering
- Jailbreak detection
- Harmful content detection
- Safety classifiers
Model Cards & Documentation (MEDIUM):
- Automated model cards
- Bias detection reports
- Performance stratification
- Ethical considerations documentation
Use Cases
- Defend against adversarial attacks
- Safety-critical deployments
- Aligned AI systems
- Responsible AI
- Regulatory compliance
Architecture
Success Criteria
- Robustness benchmarks (RobustBench)
- Certified accuracy metrics
- RLHF implementation
- Safety testing frameworks
Issue #421: Junior Developer Implementation Guide
Understanding Adversarial Robustness and AI Safety
Goal: Build defenses against adversarial attacks, implement safety monitoring, model alignment (RLHF), and comprehensive AI safety documentation for production systems.
Key Concepts for Beginners
What are Adversarial Attacks?
The Problem: Carefully crafted inputs that fool neural networks.
Example - Image Classification:
Original image: Cat (99% confident)
Add tiny noise (invisible to humans): "Dog" (95% confident)
The noise is so small humans can't see it, but the model completely changes its prediction.
Why This Matters:
- Security: Attacker could fool facial recognition, spam filters
- Safety: Self-driving car misclassifies stop sign as speed limit
- Trust: Model fails on inputs it should handle
Types of Adversarial Attacks
-
FGSM (Fast Gradient Sign Method):
- Add noise in direction that increases loss
- Fast but weak attack
-
PGD (Projected Gradient Descent):
- Iteratively apply FGSM multiple times
- Stronger, slower attack
-
C&W (Carlini & Wagner):
- Optimization-based attack
- Very strong, finds minimal perturbation
What is RLHF (Reinforcement Learning from Human Feedback)?
The Concept: Train models to align with human preferences.
Process:
1. Model generates multiple responses
2. Humans rank the responses (best to worst)
3. Train reward model to predict human preferences
4. Use RL to optimize model to maximize predicted reward
Why This Matters: Prevents models from generating harmful, biased, or unhelpful content.
Phase 1: Adversarial Attack Implementation
AC 1.1: Implement FGSM Attack
What is FGSM? Fast Gradient Sign Method - adds noise in direction of gradient to maximize loss.
Formula: x_adv = x + ε * sign(∇_x L(x, y))
-
x: Original input -
ε: Perturbation size (epsilon) -
∇_x L: Gradient of loss w.r.t. input -
sign(): Take sign of gradient (+1 or -1)
File: src/Adversarial/FGSMAttack.cs
Step 1: Implement FGSM
// File: src/Adversarial/FGSMAttack.cs
namespace AiDotNet.Adversarial;
/// <summary>
/// Fast Gradient Sign Method adversarial attack.
/// Generates adversarial examples by adding small perturbations in gradient direction.
/// </summary>
public class FGSMAttack<T>
{
private readonly IModel<T> _model;
private readonly ILoss<T> _loss;
/// <summary>
/// Creates FGSM attack.
/// </summary>
/// <param name="model">Target model to attack</param>
/// <param name="loss">Loss function to maximize</param>
public FGSMAttack(IModel<T> model, ILoss<T> loss)
{
_model = model ?? throw new ArgumentNullException(nameof(model));
_loss = loss ?? throw new ArgumentNullException(nameof(loss));
}
/// <summary>
/// Generate adversarial example using FGSM.
/// </summary>
/// <param name="input">Original input</param>
/// <param name="target">True label</param>
/// <param name="epsilon">Perturbation magnitude (typically 0.01-0.3)</param>
/// <returns>Adversarial example</returns>
public Matrix<T> GenerateAdversarialExample(
Matrix<T> input,
Matrix<T> target,
double epsilon)
{
var numOps = NumericOperations<T>.Instance;
// Forward pass
var prediction = _model.Forward(input, training: true);
// Compute loss
var lossValue = _loss.Compute(prediction, target);
// Compute gradient of loss w.r.t. input
var lossGradient = _loss.ComputeGradient(prediction, target);
// Backpropagate through model to get gradient w.r.t. input
var inputGradient = _model.BackwardToInput(lossGradient);
// Create adversarial example: x_adv = x + epsilon * sign(gradient)
var adversarial = new Matrix<T>(input.Rows, input.Columns);
T epsilonValue = numOps.FromDouble(epsilon);
for (int r = 0; r < input.Rows; r++)
{
for (int c = 0; c < input.Columns; c++)
{
// Get sign of gradient
double gradValue = Convert.ToDouble(inputGradient[r, c]);
double sign = gradValue > 0 ? 1.0 : (gradValue < 0 ? -1.0 : 0.0);
// Add perturbation
T perturbation = numOps.Multiply(
epsilonValue,
numOps.FromDouble(sign)
);
adversarial[r, c] = numOps.Add(input[r, c], perturbation);
// Clip to valid range [0, 1] for images
double clipped = Math.Max(0.0, Math.Min(1.0, Convert.ToDouble(adversarial[r, c])));
adversarial[r, c] = numOps.FromDouble(clipped);
}
}
return adversarial;
}
/// <summary>
/// Evaluate attack success rate on a dataset.
/// </summary>
public AttackResult EvaluateAttack(
List<(Matrix<T> Input, Matrix<T> Target)> testData,
double epsilon)
{
int totalSamples = testData.Count;
int successfulAttacks = 0;
double avgPerturbation = 0.0;
foreach (var sample in testData)
{
// Generate adversarial example
var adversarial = GenerateAdversarialExample(sample.Input, sample.Target, epsilon);
// Check if attack succeeded (model misclassified adversarial example)
var originalPrediction = _model.Forward(sample.Input, training: false);
var adversarialPrediction = _model.Forward(adversarial, training: false);
int originalClass = GetPredictedClass(originalPrediction);
int adversarialClass = GetPredictedClass(adversarialPrediction);
if (originalClass != adversarialClass)
{
successfulAttacks++;
}
// Compute perturbation magnitude
double perturbationNorm = ComputeL2Norm(sample.Input, adversarial);
avgPerturbation += perturbationNorm;
}
avgPerturbation /= totalSamples;
return new AttackResult
{
TotalSamples = totalSamples,
SuccessfulAttacks = successfulAttacks,
SuccessRate = (double)successfulAttacks / totalSamples,
AveragePerturbationNorm = avgPerturbation
};
}
private int GetPredictedClass(Matrix<T> prediction)
{
int maxIndex = 0;
double maxValue = Convert.ToDouble(prediction[0, 0]);
for (int c = 1; c < prediction.Columns; c++)
{
double value = Convert.ToDouble(prediction[0, c]);
if (value > maxValue)
{
maxValue = value;
maxIndex = c;
}
}
return maxIndex;
}
private double ComputeL2Norm(Matrix<T> original, Matrix<T> perturbed)
{
double sumSquared = 0.0;
for (int r = 0; r < original.Rows; r++)
{
for (int c = 0; c < original.Columns; c++)
{
double diff = Convert.ToDouble(perturbed[r, c]) - Convert.ToDouble(original[r, c]);
sumSquared += diff * diff;
}
}
return Math.Sqrt(sumSquared);
}
}
/// <summary>
/// Results from adversarial attack evaluation.
/// </summary>
public class AttackResult
{
public int TotalSamples { get; set; }
public int SuccessfulAttacks { get; set; }
public double SuccessRate { get; set; }
public double AveragePerturbationNorm { get; set; }
}
Step 2: Create unit test
// File: tests/UnitTests/Adversarial/FGSMAttackTests.cs
namespace AiDotNet.Tests.Adversarial;
public class FGSMAttackTests
{
[Fact]
public void GenerateAdversarialExample_ChangesInput()
{
// Arrange
var model = CreateSimpleModel();
var loss = new CrossEntropyLoss<double>();
var attack = new FGSMAttack<double>(model, loss);
var input = CreateTestInput();
var target = CreateTestTarget();
// Act
var adversarial = attack.GenerateAdversarialExample(input, target, epsilon: 0.1);
// Assert
// Adversarial should be different from original
bool hasDifference = false;
for (int r = 0; r < input.Rows; r++)
{
for (int c = 0; c < input.Columns; c++)
{
if (input[r, c] != adversarial[r, c])
{
hasDifference = true;
break;
}
}
}
Assert.True(hasDifference);
// All values should be in [0, 1] range
for (int r = 0; r < adversarial.Rows; r++)
{
for (int c = 0; c < adversarial.Columns; c++)
{
Assert.InRange(adversarial[r, c], 0.0, 1.0);
}
}
}
[Fact]
public void EvaluateAttack_ReturnsSuccessRate()
{
// Arrange
var model = CreateSimpleModel();
var loss = new CrossEntropyLoss<double>();
var attack = new FGSMAttack<double>(model, loss);
var testData = CreateTestDataset(100);
// Act
var result = attack.EvaluateAttack(testData, epsilon: 0.15);
// Assert
Assert.InRange(result.SuccessRate, 0.0, 1.0);
Assert.True(result.TotalSamples == 100);
Assert.True(result.AveragePerturbationNorm >= 0.0);
}
}
AC 1.2: Implement PGD Attack
What is PGD? Projected Gradient Descent - iteratively applies FGSM and projects back to valid region.
Algorithm:
1. Start with original input + small random noise
2. For k iterations:
- Apply FGSM step
- Project back to epsilon ball around original input
3. Return final perturbed input
File: src/Adversarial/PGDAttack.cs
// File: src/Adversarial/PGDAttack.cs
namespace AiDotNet.Adversarial;
/// <summary>
/// Projected Gradient Descent adversarial attack.
/// Stronger iterative version of FGSM.
/// </summary>
public class PGDAttack<T>
{
private readonly IModel<T> _model;
private readonly ILoss<T> _loss;
private readonly Random _random;
public PGDAttack(IModel<T> model, ILoss<T> loss)
{
_model = model ?? throw new ArgumentNullException(nameof(model));
_loss = loss ?? throw new ArgumentNullException(nameof(loss));
_random = new Random();
}
/// <summary>
/// Generate adversarial example using PGD.
/// </summary>
/// <param name="input">Original input</param>
/// <param name="target">True label</param>
/// <param name="epsilon">Maximum perturbation (L-infinity bound)</param>
/// <param name="alpha">Step size per iteration</param>
/// <param name="numIterations">Number of PGD iterations</param>
/// <param name="randomStart">Whether to start from random point in epsilon ball</param>
public Matrix<T> GenerateAdversarialExample(
Matrix<T> input,
Matrix<T> target,
double epsilon,
double alpha = 0.01,
int numIterations = 40,
bool randomStart = true)
{
var numOps = NumericOperations<T>.Instance;
// Initialize adversarial example
var adversarial = new Matrix<T>(input.Rows, input.Columns);
if (randomStart)
{
// Start from random point in epsilon ball
for (int r = 0; r < input.Rows; r++)
{
for (int c = 0; c < input.Columns; c++)
{
double randomNoise = (_random.NextDouble() - 0.5) * 2 * epsilon;
double value = Convert.ToDouble(input[r, c]) + randomNoise;
adversarial[r, c] = numOps.FromDouble(Math.Max(0.0, Math.Min(1.0, value)));
}
}
}
else
{
// Start from original input
for (int r = 0; r < input.Rows; r++)
{
for (int c = 0; c < input.Columns; c++)
{
adversarial[r, c] = input[r, c];
}
}
}
// Iterative FGSM
for (int iter = 0; iter < numIterations; iter++)
{
// Forward pass
var prediction = _model.Forward(adversarial, training: true);
// Compute loss and gradient
var lossGradient = _loss.ComputeGradient(prediction, target);
var inputGradient = _model.BackwardToInput(lossGradient);
// Apply FGSM step
T alphaValue = numOps.FromDouble(alpha);
for (int r = 0; r < adversarial.Rows; r++)
{
for (int c = 0; c < adversarial.Columns; c++)
{
// Gradient step
double gradValue = Convert.ToDouble(inputGradient[r, c]);
double sign = gradValue > 0 ? 1.0 : (gradValue < 0 ? -1.0 : 0.0);
double currentValue = Convert.ToDouble(adversarial[r, c]);
double newValue = currentValue + alpha * sign;
// Project back to epsilon ball around original input
double originalValue = Convert.ToDouble(input[r, c]);
newValue = Math.Max(originalValue - epsilon, Math.Min(originalValue + epsilon, newValue));
// Clip to [0, 1]
newValue = Math.Max(0.0, Math.Min(1.0, newValue));
adversarial[r, c] = numOps.FromDouble(newValue);
}
}
}
return adversarial;
}
}
Phase 2: Adversarial Training Defense
AC 2.1: Implement AdversarialTrainer
What is Adversarial Training? Train model on both clean and adversarial examples to make it robust.
Process:
1. For each training batch:
- Generate adversarial examples
- Mix with clean examples
- Train on combined dataset
2. Model learns to be robust to perturbations
File: src/Adversarial/AdversarialTrainer.cs
// File: src/Adversarial/AdversarialTrainer.cs
namespace AiDotNet.Adversarial;
/// <summary>
/// Trains models with adversarial examples to improve robustness.
/// </summary>
public class AdversarialTrainer<T>
{
private readonly IModel<T> _model;
private readonly ILoss<T> _loss;
private readonly IOptimizer<T> _optimizer;
private readonly FGSMAttack<T> _fgsmAttack;
private readonly PGDAttack<T> _pgdAttack;
public enum AttackType
{
FGSM,
PGD
}
/// <summary>
/// Ratio of adversarial examples to clean examples (0.0 to 1.0).
/// 0.5 means 50% adversarial, 50% clean.
/// </summary>
public double AdversarialRatio { get; set; } = 0.5;
/// <summary>
/// Perturbation magnitude for adversarial examples.
/// </summary>
public double Epsilon { get; set; } = 0.1;
public AdversarialTrainer(
IModel<T> model,
ILoss<T> loss,
IOptimizer<T> optimizer)
{
_model = model ?? throw new ArgumentNullException(nameof(model));
_loss = loss ?? throw new ArgumentNullException(nameof(loss));
_optimizer = optimizer ?? throw new ArgumentNullException(nameof(optimizer));
_fgsmAttack = new FGSMAttack<T>(model, loss);
_pgdAttack = new PGDAttack<T>(model, loss);
}
/// <summary>
/// Train model with adversarial examples.
/// </summary>
public void Train(
IDataLoader<T> dataLoader,
int epochs,
AttackType attackType = AttackType.PGD)
{
for (int epoch = 0; epoch < epochs; epoch++)
{
double totalLoss = 0.0;
int batchCount = 0;
foreach (var batch in dataLoader.GetBatches())
{
// Generate adversarial examples for this batch
var adversarialBatch = GenerateAdversarialBatch(
batch.Input,
batch.Target,
attackType
);
// Mix adversarial and clean examples based on ratio
var mixedInput = MixBatch(batch.Input, adversarialBatch, AdversarialRatio);
var mixedTarget = DuplicateTarget(batch.Target, mixedInput.Rows);
// Forward pass
var prediction = _model.Forward(mixedInput, training: true);
// Compute loss
var lossValue = _loss.Compute(prediction, mixedTarget);
totalLoss += Convert.ToDouble(lossValue);
// Backward pass
var gradient = _loss.ComputeGradient(prediction, mixedTarget);
var paramGradients = _model.Backward(gradient);
// Update parameters
var parameters = _model.GetParameters();
_optimizer.Update(parameters, paramGradients);
batchCount++;
}
double avgLoss = totalLoss / batchCount;
Console.WriteLine($"Epoch {epoch + 1}/{epochs}, Loss: {avgLoss:F4}");
}
}
private Matrix<T> GenerateAdversarialBatch(
Matrix<T> input,
Matrix<T> target,
AttackType attackType)
{
int batchSize = input.Rows;
var adversarialExamples = new List<Matrix<T>>();
for (int i = 0; i < batchSize; i++)
{
// Extract single example
var singleInput = ExtractRow(input, i);
var singleTarget = ExtractRow(target, i);
// Generate adversarial example
Matrix<T> adversarial;
if (attackType == AttackType.FGSM)
{
adversarial = _fgsmAttack.GenerateAdversarialExample(
singleInput, singleTarget, Epsilon);
}
else // PGD
{
adversarial = _pgdAttack.GenerateAdversarialExample(
singleInput, singleTarget, Epsilon);
}
adversarialExamples.Add(adversarial);
}
// Combine into batch
return CombineRows(adversarialExamples);
}
private Matrix<T> MixBatch(Matrix<T> clean, Matrix<T> adversarial, double adversarialRatio)
{
int cleanCount = (int)((1.0 - adversarialRatio) * clean.Rows);
int advCount = (int)(adversarialRatio * clean.Rows);
var mixed = new List<Matrix<T>>();
// Add clean examples
for (int i = 0; i < cleanCount; i++)
{
mixed.Add(ExtractRow(clean, i));
}
// Add adversarial examples
for (int i = 0; i < advCount; i++)
{
mixed.Add(ExtractRow(adversarial, i));
}
return CombineRows(mixed);
}
private Matrix<T> ExtractRow(Matrix<T> matrix, int rowIndex)
{
var row = new Matrix<T>(1, matrix.Columns);
for (int c = 0; c < matrix.Columns; c++)
{
row[0, c] = matrix[rowIndex, c];
}
return row;
}
private Matrix<T> CombineRows(List<Matrix<T>> rows)
{
if (rows.Count == 0)
throw new ArgumentException("Cannot combine empty list of rows");
int cols = rows[0].Columns;
var combined = new Matrix<T>(rows.Count, cols);
for (int r = 0; r < rows.Count; r++)
{
for (int c = 0; c < cols; c++)
{
combined[r, c] = rows[r][0, c];
}
}
return combined;
}
private Matrix<T> DuplicateTarget(Matrix<T> target, int newRows)
{
var duplicated = new Matrix<T>(newRows, target.Columns);
int originalRows = target.Rows;
for (int r = 0; r < newRows; r++)
{
int sourceRow = r % originalRows;
for (int c = 0; c < target.Columns; c++)
{
duplicated[r, c] = target[sourceRow, c];
}
}
return duplicated;
}
}
Phase 3: RLHF Implementation
AC 3.1: Implement RewardModel
What is a Reward Model? A neural network trained to predict human preferences between two outputs.
Training Data:
Input: "Write a poem about cats"
Output A: "Cats are fluffy and cute..."
Output B: "Felines possess soft fur..."
Human preference: A is better (label = 1)
File: src/RLHF/RewardModel.cs
// File: src/RLHF/RewardModel.cs
namespace AiDotNet.RLHF;
/// <summary>
/// Reward model that predicts human preferences.
/// Trained on pairs of outputs with human preference labels.
/// </summary>
public class RewardModel<T>
{
private readonly IModel<T> _baseModel;
private readonly ILoss<T> _loss;
private readonly IOptimizer<T> _optimizer;
public RewardModel(IModel<T> baseModel, IOptimizer<T> optimizer)
{
_baseModel = baseModel ?? throw new ArgumentNullException(nameof(baseModel));
_optimizer = optimizer ?? throw new ArgumentNullException(nameof(optimizer));
_loss = new CrossEntropyLoss<T>(); // Binary classification
}
/// <summary>
/// Train reward model on preference pairs.
/// </summary>
/// <param name="preferences">List of (outputA, outputB, preferenceLabel)</param>
/// <param name="epochs">Number of training epochs</param>
public void Train(
List<(Matrix<T> OutputA, Matrix<T> OutputB, int PreferredIndex)> preferences,
int epochs)
{
for (int epoch = 0; epoch < epochs; epoch++)
{
double totalLoss = 0.0;
foreach (var pref in preferences)
{
// Get rewards for both outputs
var rewardA = _baseModel.Forward(pref.OutputA, training: true);
var rewardB = _baseModel.Forward(pref.OutputB, training: true);
// Compute preference probability using Bradley-Terry model
// P(A > B) = exp(r_A) / (exp(r_A) + exp(r_B))
var prefProbability = ComputePreferenceProbability(rewardA, rewardB);
// Create target (1 if A preferred, 0 if B preferred)
var target = CreatePreferenceTarget(pref.PreferredIndex);
// Compute loss
var lossValue = _loss.Compute(prefProbability, target);
totalLoss += Convert.ToDouble(lossValue);
// Backpropagate and update
var gradient = _loss.ComputeGradient(prefProbability, target);
var paramGradients = _baseModel.Backward(gradient);
var parameters = _baseModel.GetParameters();
_optimizer.Update(parameters, paramGradients);
}
double avgLoss = totalLoss / preferences.Count;
Console.WriteLine($"Reward Model Epoch {epoch + 1}/{epochs}, Loss: {avgLoss:F4}");
}
}
/// <summary>
/// Predict reward (score) for an output.
/// Higher reward = better quality according to human preferences.
/// </summary>
public double PredictReward(Matrix<T> output)
{
var reward = _baseModel.Forward(output, training: false);
return Convert.ToDouble(reward[0, 0]); // Scalar reward
}
private Matrix<T> ComputePreferenceProbability(Matrix<T> rewardA, Matrix<T> rewardB)
{
var numOps = NumericOperations<T>.Instance;
// Bradley-Terry: P(A > B) = exp(r_A) / (exp(r_A) + exp(r_B))
double rA = Convert.ToDouble(rewardA[0, 0]);
double rB = Convert.ToDouble(rewardB[0, 0]);
double expA = Math.Exp(rA);
double expB = Math.Exp(rB);
double probA = expA / (expA + expB);
var result = new Matrix<T>(1, 2);
result[0, 0] = numOps.FromDouble(probA); // P(A > B)
result[0, 1] = numOps.FromDouble(1.0 - probA); // P(B > A)
return result;
}
private Matrix<T> CreatePreferenceTarget(int preferredIndex)
{
var numOps = NumericOperations<T>.Instance;
var target = new Matrix<T>(1, 2);
if (preferredIndex == 0) // A is preferred
{
target[0, 0] = numOps.One;
target[0, 1] = numOps.Zero;
}
else // B is preferred
{
target[0, 0] = numOps.Zero;
target[0, 1] = numOps.One;
}
return target;
}
}
AC 3.2: Implement PPO for RLHF
What is PPO? Proximal Policy Optimization - stable RL algorithm for fine-tuning models with reward.
Key Idea: Update model to increase reward, but not too much at once (to maintain stability).
File: src/RLHF/PPOTrainer.cs
// File: src/RLHF/PPOTrainer.cs
namespace AiDotNet.RLHF;
/// <summary>
/// Proximal Policy Optimization trainer for RLHF.
/// Fine-tunes model using rewards from reward model.
/// </summary>
public class PPOTrainer<T>
{
private readonly IModel<T> _policyModel;
private readonly RewardModel<T> _rewardModel;
private readonly IOptimizer<T> _optimizer;
/// <summary>Clip ratio for PPO (prevents large policy updates)</summary>
public double ClipEpsilon { get; set; } = 0.2;
/// <summary>KL divergence penalty coefficient</summary>
public double KLCoefficient { get; set; } = 0.01;
public PPOTrainer(
IModel<T> policyModel,
RewardModel<T> rewardModel,
IOptimizer<T> optimizer)
{
_policyModel = policyModel ?? throw new ArgumentNullException(nameof(policyModel));
_rewardModel = rewardModel ?? throw new ArgumentNullException(nameof(rewardModel));
_optimizer = optimizer ?? throw new ArgumentNullException(nameof(optimizer));
}
/// <summary>
/// Train policy model using PPO and reward model.
/// </summary>
public void Train(
IDataLoader<T> dataLoader,
int epochs,
int ppoEpochs = 4)
{
// Store initial policy for KL penalty
var initialPolicy = CloneModel(_policyModel);
for (int epoch = 0; epoch < epochs; epoch++)
{
var experienceBuffer = new List<Experience<T>>();
// Collect experience
foreach (var batch in dataLoader.GetBatches())
{
// Generate outputs with current policy
var output = _policyModel.Forward(batch.Input, training: false);
// Get reward from reward model
double reward = _rewardModel.PredictReward(output);
// Store experience
experienceBuffer.Add(new Experience<T>
{
State = batch.Input,
Action = output,
Reward = reward
});
}
// PPO update epochs
for (int ppoEpoch = 0; ppoEpoch < ppoEpochs; ppoEpoch++)
{
double totalLoss = 0.0;
foreach (var experience in experienceBuffer)
{
// Current policy output
var newOutput = _policyModel.Forward(experience.State, training: true);
// Compute PPO loss
var loss = ComputePPOLoss(
experience.Action,
newOutput,
experience.Reward,
initialPolicy.Forward(experience.State, training: false)
);
totalLoss += Convert.ToDouble(loss);
// Update policy
// (Simplified - actual implementation needs proper gradient computation)
var parameters = _policyModel.GetParameters();
// Update using optimizer...
}
Console.WriteLine($"PPO Epoch {ppoEpoch + 1}/{ppoEpochs}, Loss: {totalLoss / experienceBuffer.Count:F4}");
}
}
}
private T ComputePPOLoss(
Matrix<T> oldAction,
Matrix<T> newAction,
double reward,
Matrix<T> initialAction)
{
var numOps = NumericOperations<T>.Instance;
// Compute ratio: π_new(a|s) / π_old(a|s)
// Simplified: using L2 distance as proxy
double ratio = ComputeActionRatio(oldAction, newAction);
// Clipped surrogate objective
double clippedRatio = Math.Max(
Math.Min(ratio, 1.0 + ClipEpsilon),
1.0 - ClipEpsilon
);
double advantage = reward; // Simplified advantage (should be reward - baseline)
double surrogateObjective = Math.Min(
ratio * advantage,
clippedRatio * advantage
);
// KL penalty to prevent drifting too far from initial policy
double klPenalty = ComputeKLDivergence(initialAction, newAction);
// Total loss (negative because we want to maximize)
double loss = -surrogateObjective + KLCoefficient * klPenalty;
return numOps.FromDouble(loss);
}
private double ComputeActionRatio(Matrix<T> oldAction, Matrix<T> newAction)
{
// Simplified ratio computation
// In practice, this should be the probability ratio
double distance = 0.0;
for (int r = 0; r < oldAction.Rows; r++)
{
for (int c = 0; c < oldAction.Columns; c++)
{
double diff = Convert.ToDouble(newAction[r, c]) - Convert.ToDouble(oldAction[r, c]);
distance += diff * diff;
}
}
return Math.Exp(-distance); // Convert distance to ratio-like value
}
private double ComputeKLDivergence(Matrix<T> p, Matrix<T> q)
{
double kl = 0.0;
for (int r = 0; r < p.Rows; r++)
{
for (int c = 0; c < p.Columns; c++)
{
double pVal = Math.Max(Convert.ToDouble(p[r, c]), 1e-10);
double qVal = Math.Max(Convert.ToDouble(q[r, c]), 1e-10);
kl += pVal * Math.Log(pVal / qVal);
}
}
return kl;
}
private IModel<T> CloneModel(IModel<T> model)
{
// Deep copy of model (simplified - actual implementation needs proper cloning)
// For now, return reference (should implement proper cloning)
return model;
}
}
public class Experience<T>
{
public Matrix<T> State { get; set; } = new Matrix<T>(0, 0);
public Matrix<T> Action { get; set; } = new Matrix<T>(0, 0);
public double Reward { get; set; }
}
Phase 4: Safety Monitoring and Documentation
AC 4.1: Implement ModelCard
What is a Model Card? Documentation of model's capabilities, limitations, biases, and intended use.
File: src/Safety/ModelCard.cs
// File: src/Safety/ModelCard.cs
namespace AiDotNet.Safety;
/// <summary>
/// Model Card for transparent documentation of ML models.
/// Based on "Model Cards for Model Reporting" (Mitchell et al., 2019).
/// </summary>
public class ModelCard
{
/// <summary>Model name and version</summary>
public ModelDetails Details { get; set; } = new ModelDetails();
/// <summary>Intended use cases and users</summary>
public IntendedUse IntendedUse { get; set; } = new IntendedUse();
/// <summary>Training and evaluation data</summary>
public DataInformation Data { get; set; } = new DataInformation();
/// <summary>Performance metrics on different subgroups</summary>
public List<PerformanceMetric> PerformanceMetrics { get; set; } = new List<PerformanceMetric>();
/// <summary>Known limitations and biases</summary>
public List<string> Limitations { get; set; } = new List<string>();
/// <summary>Ethical considerations</summary>
public List<string> EthicalConsiderations { get; set; } = new List<string>();
/// <summary>
/// Export model card to JSON format.
/// </summary>
public string ToJson()
{
return System.Text.Json.JsonSerializer.Serialize(this, new System.Text.Json.JsonSerializerOptions
{
WriteIndented = true
});
}
/// <summary>
/// Export model card to markdown format.
/// </summary>
public string ToMarkdown()
{
var sb = new System.Text.StringBuilder();
sb.AppendLine($"# Model Card: {Details.Name}");
sb.AppendLine();
sb.AppendLine("## Model Details");
sb.AppendLine($"- **Version**: {Details.Version}");
sb.AppendLine($"- **Type**: {Details.ModelType}");
sb.AppendLine($"- **Architecture**: {Details.Architecture}");
sb.AppendLine($"- **Date**: {Details.Date:yyyy-MM-dd}");
sb.AppendLine($"- **Authors**: {string.Join(", ", Details.Authors)}");
sb.AppendLine();
sb.AppendLine("## Intended Use");
sb.AppendLine($"- **Primary Use**: {IntendedUse.PrimaryUse}");
sb.AppendLine($"- **Primary Users**: {string.Join(", ", IntendedUse.PrimaryUsers)}");
sb.AppendLine($"- **Out-of-Scope Uses**: {string.Join(", ", IntendedUse.OutOfScopeUses)}");
sb.AppendLine();
sb.AppendLine("## Training Data");
sb.AppendLine($"- **Dataset**: {Data.TrainingDataset}");
sb.AppendLine($"- **Size**: {Data.TrainingDataSize:N0} examples");
sb.AppendLine($"- **Preprocessing**: {Data.Preprocessing}");
sb.AppendLine();
sb.AppendLine("## Performance Metrics");
foreach (var metric in PerformanceMetrics)
{
sb.AppendLine($"- **{metric.Name}** ({metric.Subgroup}): {metric.Value:F4}");
}
sb.AppendLine();
sb.AppendLine("## Limitations");
foreach (var limitation in Limitations)
{
sb.AppendLine($"- {limitation}");
}
sb.AppendLine();
sb.AppendLine("## Ethical Considerations");
foreach (var consideration in EthicalConsiderations)
{
sb.AppendLine($"- {consideration}");
}
return sb.ToString();
}
}
public class ModelDetails
{
public string Name { get; set; } = string.Empty;
public string Version { get; set; } = "1.0.0";
public string ModelType { get; set; } = string.Empty;
public string Architecture { get; set; } = string.Empty;
public DateTime Date { get; set; } = DateTime.Now;
public List<string> Authors { get; set; } = new List<string>();
public string License { get; set; } = string.Empty;
}
public class IntendedUse
{
public string PrimaryUse { get; set; } = string.Empty;
public List<string> PrimaryUsers { get; set; } = new List<string>();
public List<string> OutOfScopeUses { get; set; } = new List<string>();
}
public class DataInformation
{
public string TrainingDataset { get; set; } = string.Empty;
public int TrainingDataSize { get; set; }
public string EvaluationDataset { get; set; } = string.Empty;
public int EvaluationDataSize { get; set; }
public string Preprocessing { get; set; } = string.Empty;
}
public class PerformanceMetric
{
public string Name { get; set; } = string.Empty;
public string Subgroup { get; set; } = "Overall";
public double Value { get; set; }
}
AC 4.2: Implement SafetyMonitor
What does this do? Monitors model outputs in production for safety issues (toxicity, bias, hallucinations).
File: src/Safety/SafetyMonitor.cs
// File: src/Safety/SafetyMonitor.cs
namespace AiDotNet.Safety;
/// <summary>
/// Monitors model predictions for safety issues in production.
/// </summary>
public class SafetyMonitor<T>
{
private readonly List<ISafetyCheck> _checks;
private readonly List<SafetyIncident> _incidents;
public SafetyMonitor()
{
_checks = new List<ISafetyCheck>();
_incidents = new List<SafetyIncident>();
}
/// <summary>
/// Add a safety check to the monitor.
/// </summary>
public void AddCheck(ISafetyCheck check)
{
_checks.Add(check);
}
/// <summary>
/// Check a prediction for safety issues.
/// </summary>
/// <returns>Safety result with any violations found</returns>
public SafetyResult CheckPrediction(
Matrix<T> input,
Matrix<T> prediction,
Dictionary<string, object>? metadata = null)
{
var violations = new List<SafetyViolation>();
foreach (var check in _checks)
{
var result = check.Check(input, prediction, metadata);
if (!result.IsSafe)
{
violations.AddRange(result.Violations);
}
}
if (violations.Any())
{
// Log incident
_incidents.Add(new SafetyIncident
{
Timestamp = DateTime.UtcNow,
Input = input,
Prediction = prediction,
Violations = violations,
Metadata = metadata
});
}
return new SafetyResult
{
IsSafe = violations.Count == 0,
Violations = violations
};
}
/// <summary>
/// Get all safety incidents.
/// </summary>
public IReadOnlyList<SafetyIncident> GetIncidents()
{
return _incidents.AsReadOnly();
}
/// <summary>
/// Get incident statistics.
/// </summary>
public SafetyStatistics GetStatistics()
{
return new SafetyStatistics
{
TotalIncidents = _incidents.Count,
ViolationsByType = _incidents
.SelectMany(i => i.Violations)
.GroupBy(v => v.Type)
.ToDictionary(g => g.Key, g => g.Count())
};
}
}
/// <summary>
/// Interface for safety checks.
/// </summary>
public interface ISafetyCheck
{
SafetyResult Check(Matrix<T> input, Matrix<T> prediction, Dictionary<string, object>? metadata);
}
/// <summary>
/// Result of safety check.
/// </summary>
public class SafetyResult
{
public bool IsSafe { get; set; }
public List<SafetyViolation> Violations { get; set; } = new List<SafetyViolation>();
}
/// <summary>
/// A specific safety violation.
/// </summary>
public class SafetyViolation
{
public string Type { get; set; } = string.Empty;
public string Description { get; set; } = string.Empty;
public double Severity { get; set; } // 0.0 to 1.0
public Dictionary<string, object> Details { get; set; } = new Dictionary<string, object>();
}
/// <summary>
/// Record of a safety incident.
/// </summary>
public class SafetyIncident
{
public DateTime Timestamp { get; set; }
public Matrix<T> Input { get; set; } = new Matrix<T>(0, 0);
public Matrix<T> Prediction { get; set; } = new Matrix<T>(0, 0);
public List<SafetyViolation> Violations { get; set; } = new List<SafetyViolation>();
public Dictionary<string, object>? Metadata { get; set; }
}
public class SafetyStatistics
{
public int TotalIncidents { get; set; }
public Dictionary<string, int> ViolationsByType { get; set; } = new Dictionary<string, int>();
}
Step 2: Implement common safety checks
// File: src/Safety/SafetyChecks/ConfidenceThresholdCheck.cs
namespace AiDotNet.Safety.SafetyChecks;
/// <summary>
/// Flags predictions with very low confidence as potentially unsafe.
/// </summary>
public class ConfidenceThresholdCheck<T> : ISafetyCheck
{
private readonly double _threshold;
public ConfidenceThresholdCheck(double threshold = 0.5)
{
_threshold = threshold;
}
public SafetyResult Check(
Matrix<T> input,
Matrix<T> prediction,
Dictionary<string, object>? metadata)
{
// Get max probability (confidence)
double maxProb = 0.0;
for (int c = 0; c < prediction.Columns; c++)
{
double prob = Convert.ToDouble(prediction[0, c]);
if (prob > maxProb)
maxProb = prob;
}
if (maxProb < _threshold)
{
return new SafetyResult
{
IsSafe = false,
Violations = new List<SafetyViolation>
{
new SafetyViolation
{
Type = "LowConfidence",
Description = $"Prediction confidence {maxProb:F2} below threshold {_threshold:F2}",
Severity = 1.0 - maxProb,
Details = new Dictionary<string, object>
{
{ "confidence", maxProb },
{ "threshold", _threshold }
}
}
}
};
}
return new SafetyResult { IsSafe = true };
}
}
Testing Strategy
Integration Test: End-to-End Safety Pipeline
// File: tests/IntegrationTests/Safety/SafetyPipelineTests.cs
namespace AiDotNet.Tests.Safety;
public class SafetyPipelineTests
{
[Fact]
public void FullPipeline_AdversarialTraining_ImprovesRobustness()
{
// Train baseline model
var baselineModel = CreateModel();
TrainModel(baselineModel, cleanData);
// Evaluate against attacks
var fgsmAttack = new FGSMAttack<double>(baselineModel, new CrossEntropyLoss<double>());
var baselineResult = fgsmAttack.EvaluateAttack(testData, epsilon: 0.1);
// Train adversarially robust model
var robustModel = CreateModel();
var advTrainer = new AdversarialTrainer<double>(
robustModel,
new CrossEntropyLoss<double>(),
new SGD<double>(0.01)
);
advTrainer.Train(cleanData, epochs: 10, AttackType.PGD);
// Evaluate robust model
var fgsmAttackRobust = new FGSMAttack<double>(robustModel, new CrossEntropyLoss<double>());
var robustResult = fgsmAttackRobust.EvaluateAttack(testData, epsilon: 0.1);
// Robust model should have lower attack success rate
Assert.True(robustResult.SuccessRate < baselineResult.SuccessRate);
}
}
Success Criteria Checklist
- [ ] FGSM generates adversarial examples that fool model
- [ ] PGD creates stronger attacks than FGSM
- [ ] Adversarial training reduces attack success rate by >50%
- [ ] Reward model correctly predicts human preferences
- [ ] PPO improves model reward without catastrophic performance loss
- [ ] Model card documents all required fields
- [ ] Safety monitor flags low-confidence predictions
- [ ] All tests pass with >80% coverage
Example Usage After Implementation
// Create model card
var modelCard = new ModelCard
{
Details = new ModelDetails
{
Name = "Image Classifier v2",
Version = "2.1.0",
ModelType = "Convolutional Neural Network",
Architecture = "ResNet-50",
Authors = new List<string> { "AI Team" }
},
IntendedUse = new IntendedUse
{
PrimaryUse = "Medical image classification",
PrimaryUsers = new List<string> { "Radiologists", "Medical researchers" },
OutOfScopeUses = new List<string> { "Autonomous diagnosis", "Legal decisions" }
},
Limitations = new List<string>
{
"Performance degrades on low-quality images",
"Not tested on pediatric patients",
"May exhibit bias toward certain demographics"
}
};
// Save model card
File.WriteAllText("model_card.md", modelCard.ToMarkdown());
// Setup safety monitoring
var safetyMonitor = new SafetyMonitor<double>();
safetyMonitor.AddCheck(new ConfidenceThresholdCheck<double>(0.7));
// Check prediction
var result = safetyMonitor.CheckPrediction(input, prediction);
if (!result.IsSafe)
{
Console.WriteLine("SAFETY WARNING:");
foreach (var violation in result.Violations)
{
Console.WriteLine($"- {violation.Type}: {violation.Description}");
}
}