[Phase 2 Parity] Implement Standard Image Data Augmentation
Problem: Issue 330 covers preprocessing but not training augmentation. Missing: RandomCrop (CRITICAL), RandomHorizontalFlip (CRITICAL), ColorJitter (HIGH), RandomRotation (HIGH), RandomResizedCrop (HIGH), Normalize (HIGH), GaussianBlur (MEDIUM), RandomAffine (MEDIUM). Architecture: src/Images/Augmentation/. Note: Complements issue 330. Goal: GPU-accelerated, deterministic seeding, parity with torchvision.transforms.
Issue #389: Junior Developer Implementation Guide - Feature Engineering Pipelines
Understanding Feature Engineering Pipelines
What is a Pipeline?
A pipeline chains multiple preprocessing/transformation steps together, ensuring they execute in the correct order with consistent data flow. Pipelines solve three major problems:
- Reproducibility: Same transformations applied consistently to train and test data
- Modularity: Each step is independent and testable
- Convenience: Fit/transform all steps with single commands
Key Concepts
Pipeline Pattern:
// Pipeline: Scaler → Feature Selection → Model
var pipeline = new Pipeline<double, Matrix<double>, Vector<double>>()
.AddStep(new StandardScaler())
.AddStep(new FeatureSelector(method: "variance"))
.AddStep(new LogisticRegression());
// Fit entire pipeline
await pipeline.FitAsync(trainX, trainY);
// Transform new data through entire pipeline
var predictions = await pipeline.TransformAsync(testX);
IPipelineStep Interface (Already exists!):
-
FitAsync: Learn from data -
TransformAsync: Apply transformations -
FitTransformAsync: Convenience method -
GetParameters/SetParameters: Persistence -
ValidateInput: Safety checks
Phase 1: Core Pipeline Infrastructure
AC 1.1: Implement Pipeline Class
File: src/Preprocessing/Pipeline.cs
namespace AiDotNet.Preprocessing;
/// <summary>
/// Chains multiple preprocessing and modeling steps together into a unified workflow.
/// </summary>
/// <remarks>
/// <para>
/// A Pipeline allows you to:
/// - Chain multiple transformation steps (scalers, selectors, etc.)
/// - Fit all steps sequentially on training data
/// - Transform new data through all steps automatically
/// - Save/load entire pipeline configurations
/// - Ensure consistent preprocessing between training and prediction
/// </para>
/// <para><b>For Beginners:</b> Think of a pipeline like an assembly line.
///
/// Just like a car factory has stages (frame assembly → painting → engine installation),
/// a machine learning pipeline has stages (scaling → feature selection → model training).
///
/// Each stage processes the output from the previous stage. The pipeline ensures:
/// - Stages run in the correct order
/// - Each stage uses the right settings (learned during training)
/// - New data goes through the exact same process as training data
///
/// Example:
/// 1. StandardScaler: Normalizes features
/// 2. FeatureSelector: Picks best features
/// 3. LogisticRegression: Makes predictions
///
/// Without a pipeline, you'd have to manually track and apply each step. The pipeline
/// automates this and prevents mistakes like forgetting a step or using wrong parameters.
/// </para>
/// </remarks>
/// <typeparam name="T">Numeric type for calculations.</typeparam>
/// <typeparam name="TInput">Input data type.</typeparam>
/// <typeparam name="TOutput">Output data type.</typeparam>
public class Pipeline<T, TInput, TOutput>
{
private readonly List<IPipelineStep<T, TInput, TOutput>> _steps;
private readonly INumericOperations<T> _numOps;
private bool _isFitted;
/// <summary>
/// Gets whether the pipeline has been fitted.
/// </summary>
public bool IsFitted => _isFitted;
/// <summary>
/// Gets the number of steps in the pipeline.
/// </summary>
public int StepCount => _steps.Count;
/// <summary>
/// Initializes a new empty pipeline.
/// </summary>
public Pipeline()
{
_steps = new List<IPipelineStep<T, TInput, TOutput>>();
_numOps = NumericOperations<T>.Instance;
_isFitted = false;
}
/// <summary>
/// Adds a step to the end of the pipeline.
/// </summary>
/// <param name="step">The pipeline step to add.</param>
/// <returns>The pipeline (for method chaining).</returns>
/// <remarks>
/// <para><b>For Beginners:</b> This adds another stage to your assembly line.
///
/// You can chain multiple AddStep calls:
/// ```csharp
/// var pipeline = new Pipeline()
/// .AddStep(scaler)
/// .AddStep(selector)
/// .AddStep(model);
/// ```
///
/// Steps execute in the order added. The output of step 1 becomes input to step 2, etc.
/// </para>
/// </remarks>
public Pipeline<T, TInput, TOutput> AddStep(IPipelineStep<T, TInput, TOutput> step)
{
if (step == null)
throw new ArgumentNullException(nameof(step));
_steps.Add(step);
_isFitted = false; // Adding a step invalidates fitting
return this;
}
/// <summary>
/// Removes a step at the specified index.
/// </summary>
/// <param name="index">Index of step to remove.</param>
/// <returns>The pipeline (for method chaining).</returns>
public Pipeline<T, TInput, TOutput> RemoveStep(int index)
{
if (index < 0 || index >= _steps.Count)
throw new ArgumentOutOfRangeException(nameof(index));
_steps.RemoveAt(index);
_isFitted = false;
return this;
}
/// <summary>
/// Gets the step at the specified index.
/// </summary>
public IPipelineStep<T, TInput, TOutput> GetStep(int index)
{
if (index < 0 || index >= _steps.Count)
throw new ArgumentOutOfRangeException(nameof(index));
return _steps[index];
}
/// <summary>
/// Fits the entire pipeline on training data.
/// </summary>
/// <param name="inputs">Training input data.</param>
/// <param name="targets">Training target data (optional).</param>
/// <returns>Task representing the async operation.</returns>
/// <remarks>
/// <para>
/// Fitting the pipeline:
/// 1. Step 1 fits on original inputs, transforms inputs
/// 2. Step 2 fits on transformed inputs from step 1, transforms further
/// 3. Step 3 fits on transformed inputs from step 2, etc.
///
/// Each step learns from the output of the previous step.
/// </para>
/// <para><b>For Beginners:</b> This trains each stage of your pipeline.
///
/// Think of it like training factory workers at each station:
/// - Worker 1 (scaler) learns the min/max of raw materials
/// - Worker 2 (selector) learns which features are important in scaled materials
/// - Worker 3 (model) learns patterns in the selected features
///
/// Each worker learns from what they see (output of previous worker).
/// </para>
/// </remarks>
public async Task FitAsync(TInput inputs, TOutput? targets = default)
{
if (_steps.Count == 0)
throw new InvalidOperationException("Pipeline has no steps. Add steps before fitting.");
TInput currentInput = inputs;
TOutput? currentTargets = targets;
for (int i = 0; i < _steps.Count; i++)
{
var step = _steps[i];
// Validate input before fitting
if (!step.ValidateInput(currentInput))
{
throw new InvalidOperationException(
$"Step {i} ({step.GetType().Name}) validation failed. " +
$"Input data is not compatible with this step.");
}
// Fit and transform this step
await step.FitAsync(currentInput, currentTargets);
currentInput = await step.TransformAsync(currentInput);
// For intermediate steps, targets remain unchanged
// Final step may use targets for supervised learning
}
_isFitted = true;
}
/// <summary>
/// Transforms input data through all fitted pipeline steps.
/// </summary>
/// <param name="inputs">Input data to transform.</param>
/// <returns>Transformed output data.</returns>
/// <remarks>
/// <para>
/// Must call FitAsync before TransformAsync.
/// Applies each step's transform sequentially using the statistics learned during fitting.
/// </para>
/// <para><b>For Beginners:</b> This runs new data through your trained pipeline.
///
/// Like a product going through the assembly line:
/// - Raw materials enter station 1 (scaler uses learned min/max)
/// - Scaled materials go to station 2 (selector uses learned feature list)
/// - Selected features go to station 3 (model uses learned weights)
/// - Final prediction comes out
///
/// Each station uses what it learned during training (FitAsync).
/// </para>
/// </remarks>
public async Task<TOutput> TransformAsync(TInput inputs)
{
if (!_isFitted)
throw new InvalidOperationException("Pipeline must be fitted before transforming. Call FitAsync first.");
if (_steps.Count == 0)
throw new InvalidOperationException("Pipeline has no steps.");
TInput currentInput = inputs;
for (int i = 0; i < _steps.Count; i++)
{
var step = _steps[i];
// Validate input
if (!step.ValidateInput(currentInput))
{
throw new InvalidOperationException(
$"Step {i} ({step.GetType().Name}) validation failed during transform.");
}
// Transform through this step
var output = await step.TransformAsync(currentInput);
// If this is the last step, return the output
if (i == _steps.Count - 1)
{
return output;
}
// Otherwise, use output as input for next step
// This requires TOutput to be convertible to TInput
// For most pipelines, TInput and TOutput are the same type
if (output is TInput nextInput)
{
currentInput = nextInput;
}
else
{
throw new InvalidOperationException(
$"Step {i} output type {typeof(TOutput)} is not compatible " +
$"with next step input type {typeof(TInput)}.");
}
}
throw new InvalidOperationException("Pipeline transform completed without returning output.");
}
/// <summary>
/// Fits and transforms in one operation.
/// </summary>
/// <param name="inputs">Training data.</param>
/// <param name="targets">Target data (optional).</param>
/// <returns>Transformed output data.</returns>
public async Task<TOutput> FitTransformAsync(TInput inputs, TOutput? targets = default)
{
await FitAsync(inputs, targets);
return await TransformAsync(inputs);
}
/// <summary>
/// Gets parameters for all steps in the pipeline.
/// </summary>
/// <returns>List of parameter dictionaries, one per step.</returns>
public List<Dictionary<string, object>> GetAllParameters()
{
return _steps.Select(step => step.GetParameters()).ToList();
}
/// <summary>
/// Sets parameters for all steps in the pipeline.
/// </summary>
/// <param name="allParameters">List of parameter dictionaries, one per step.</param>
public void SetAllParameters(List<Dictionary<string, object>> allParameters)
{
if (allParameters.Count != _steps.Count)
{
throw new ArgumentException(
$"Parameter count ({allParameters.Count}) doesn't match step count ({_steps.Count}).");
}
for (int i = 0; i < _steps.Count; i++)
{
_steps[i].SetParameters(allParameters[i]);
}
_isFitted = true; // Restored from saved parameters
}
/// <summary>
/// Gets metadata for all steps.
/// </summary>
public List<Dictionary<string, string>> GetAllMetadata()
{
return _steps.Select(step => step.GetMetadata()).ToList();
}
/// <summary>
/// Clears all steps from the pipeline.
/// </summary>
public void Clear()
{
_steps.Clear();
_isFitted = false;
}
/// <summary>
/// Creates a clone of this pipeline with the same steps.
/// </summary>
/// <remarks>
/// The clone has the same steps but is not fitted.
/// You must call FitAsync on the clone before using it.
/// </remarks>
public Pipeline<T, TInput, TOutput> Clone()
{
var clone = new Pipeline<T, TInput, TOutput>();
// Clone each step (requires steps to be cloneable)
foreach (var step in _steps)
{
// For now, just add the same step reference
// TODO: Implement proper deep cloning if steps support ICloneable
clone.AddStep(step);
}
return clone;
}
}
AC 1.2: Create PipelineStep Base Class
File: src/Preprocessing/PipelineStepBase.cs
namespace AiDotNet.Preprocessing;
/// <summary>
/// Base class for pipeline steps with common functionality.
/// </summary>
/// <remarks>
/// <para><b>For Beginners:</b> This provides common functionality for all pipeline steps.
///
/// When creating a new pipeline step (like a scaler or selector), inherit from this base class.
/// It handles:
/// - Parameter storage and retrieval
/// - Basic validation
/// - Metadata management
///
/// You only need to implement the specific fit/transform logic for your step.
/// </para>
/// </remarks>
/// <typeparam name="T">Numeric type.</typeparam>
/// <typeparam name="TInput">Input type.</typeparam>
/// <typeparam name="TOutput">Output type.</typeparam>
public abstract class PipelineStepBase<T, TInput, TOutput> : IPipelineStep<T, TInput, TOutput>
{
protected readonly INumericOperations<T> NumOps;
protected Dictionary<string, object> Parameters;
protected bool IsFitted;
/// <summary>
/// Initializes the base pipeline step.
/// </summary>
protected PipelineStepBase()
{
NumOps = NumericOperations<T>.Instance;
Parameters = new Dictionary<string, object>();
IsFitted = false;
}
/// <summary>
/// Fits this step on training data.
/// </summary>
public abstract Task FitAsync(TInput inputs, TOutput? targets = default);
/// <summary>
/// Transforms input data using fitted parameters.
/// </summary>
public abstract Task<TOutput> TransformAsync(TInput inputs);
/// <summary>
/// Fits and transforms in one operation.
/// </summary>
public virtual async Task<TOutput> FitTransformAsync(TInput inputs, TOutput? targets = default)
{
await FitAsync(inputs, targets);
return await TransformAsync(inputs);
}
/// <summary>
/// Gets the parameters of this step.
/// </summary>
public virtual Dictionary<string, object> GetParameters()
{
return new Dictionary<string, object>(Parameters);
}
/// <summary>
/// Sets the parameters of this step.
/// </summary>
public virtual void SetParameters(Dictionary<string, object> parameters)
{
Parameters = new Dictionary<string, object>(parameters);
IsFitted = true;
}
/// <summary>
/// Validates input data.
/// </summary>
public virtual bool ValidateInput(TInput inputs)
{
if (inputs == null)
return false;
// Type-specific validation
if (inputs is Matrix<T> matrix)
{
return matrix.Rows > 0 && matrix.Columns > 0;
}
else if (inputs is Tensor<T> tensor)
{
return tensor.Shape.All(dim => dim > 0);
}
else if (inputs is Vector<T> vector)
{
return vector.Length > 0;
}
return true;
}
/// <summary>
/// Gets metadata about this step.
/// </summary>
public virtual Dictionary<string, string> GetMetadata()
{
return new Dictionary<string, string>
{
{ "Type", GetType().Name },
{ "IsFitted", IsFitted.ToString() },
{ "ParameterCount", Parameters.Count.ToString() }
};
}
/// <summary>
/// Ensures the step has been fitted before transformation.
/// </summary>
protected void EnsureFitted()
{
if (!IsFitted)
{
throw new InvalidOperationException(
$"{GetType().Name} must be fitted before transformation. Call FitAsync first.");
}
}
}
Phase 2: Scaler Pipeline Steps
AC 2.1: StandardScalerStep Wrapper
File: src/Preprocessing/StandardScalerStep.cs
namespace AiDotNet.Preprocessing;
/// <summary>
/// Pipeline step wrapper for StandardScaler.
/// </summary>
/// <remarks>
/// <para><b>For Beginners:</b> This wraps StandardScaler so it can be used in pipelines.
///
/// StandardScaler normalizes data to mean=0, stddev=1.
/// This wrapper makes it compatible with the pipeline interface.
/// </para>
/// </remarks>
public class StandardScalerStep<T> : PipelineStepBase<T, Matrix<T>, Matrix<T>>
{
private readonly StandardScaler<T, Matrix<T>, Vector<T>> _scaler;
private List<NormalizationParameters<T>>? _fittedParameters;
public StandardScalerStep()
{
_scaler = new StandardScaler<T, Matrix<T>, Vector<T>>();
}
public override async Task FitAsync(Matrix<T> inputs, Matrix<T>? targets = default)
{
_fittedParameters = _scaler.Fit(inputs);
IsFitted = true;
// Store parameters for serialization
Parameters["Mean"] = _fittedParameters.Select(p => p.Mean).ToList();
Parameters["StdDev"] = _fittedParameters.Select(p => p.StdDev).ToList();
await Task.CompletedTask;
}
public override async Task<Matrix<T>> TransformAsync(Matrix<T> inputs)
{
EnsureFitted();
if (_fittedParameters == null)
throw new InvalidOperationException("Scaler parameters not available.");
var result = _scaler.Transform(inputs, _fittedParameters);
return await Task.FromResult(result);
}
public override Dictionary<string, string> GetMetadata()
{
var metadata = base.GetMetadata();
metadata["ScalerType"] = "StandardScaler";
metadata["FeatureCount"] = _fittedParameters?.Count.ToString() ?? "0";
return metadata;
}
}
AC 2.2: MinMaxScalerStep Wrapper
Similar implementation for MinMaxScaler.
Phase 3: Feature Engineering Steps
AC 3.1: PolynomialFeatures Step
File: src/Preprocessing/PolynomialFeaturesStep.cs
namespace AiDotNet.Preprocessing;
/// <summary>
/// Generates polynomial and interaction features.
/// </summary>
/// <remarks>
/// <para>
/// Creates new features by computing polynomial combinations of input features.
/// For example, with degree=2 and features [x1, x2]:
/// - Original features: x1, x2
/// - Polynomial features: 1, x1, x2, x1², x1*x2, x2²
/// </para>
/// <para><b>For Beginners:</b> This creates new features by combining existing ones.
///
/// Think of it like creating new measurements from existing ones:
/// - Original features: height, weight
/// - Polynomial features: height², weight², height*weight
///
/// Why do this?
/// - Captures non-linear relationships (like BMI = weight/height²)
/// - Allows linear models to learn curved patterns
/// - Finds interactions between features
///
/// Example with degree=2:
/// - Input: [2, 3]
/// - Output: [1, 2, 3, 4, 6, 9]
/// - Meaning: [1, x1, x2, x1², x1*x2, x2²]
/// </para>
/// </remarks>
public class PolynomialFeaturesStep<T> : PipelineStepBase<T, Matrix<T>, Matrix<T>>
{
private readonly int _degree;
private readonly bool _includeBias;
private readonly bool _interactionOnly;
private int _inputFeatures;
/// <summary>
/// Initializes polynomial feature generator.
/// </summary>
/// <param name="degree">Maximum degree of polynomial features (default: 2).</param>
/// <param name="includeBias">Include bias column (all 1s) (default: true).</param>
/// <param name="interactionOnly">Only create interaction terms, not powers (default: false).</param>
public PolynomialFeaturesStep(int degree = 2, bool includeBias = true, bool interactionOnly = false)
{
if (degree < 1)
throw new ArgumentException("Degree must be at least 1.", nameof(degree));
_degree = degree;
_includeBias = includeBias;
_interactionOnly = interactionOnly;
}
public override async Task FitAsync(Matrix<T> inputs, Matrix<T>? targets = default)
{
_inputFeatures = inputs.Columns;
IsFitted = true;
Parameters["Degree"] = _degree;
Parameters["IncludeBias"] = _includeBias;
Parameters["InteractionOnly"] = _interactionOnly;
Parameters["InputFeatures"] = _inputFeatures;
await Task.CompletedTask;
}
public override async Task<Matrix<T>> TransformAsync(Matrix<T> inputs)
{
EnsureFitted();
if (inputs.Columns != _inputFeatures)
{
throw new ArgumentException(
$"Expected {_inputFeatures} features, got {inputs.Columns}.");
}
var outputFeatures = CalculateOutputFeatureCount();
var result = new Matrix<T>(inputs.Rows, outputFeatures);
for (int row = 0; row < inputs.Rows; row++)
{
var inputRow = inputs.GetRow(row);
var outputRow = GeneratePolynomialFeatures(inputRow);
result.SetRow(row, outputRow);
}
return await Task.FromResult(result);
}
/// <summary>
/// Generates polynomial features for a single sample.
/// </summary>
private Vector<T> GeneratePolynomialFeatures(Vector<T> input)
{
var features = new List<T>();
// Add bias term if requested
if (_includeBias)
{
features.Add(NumOps.One);
}
// Add original features
for (int i = 0; i < input.Length; i++)
{
features.Add(input[i]);
}
// Add higher degree terms
if (_degree >= 2)
{
for (int d = 2; d <= _degree; d++)
{
GenerateDegreeCombinations(input, d, features);
}
}
return Vector<T>.FromEnumerable(features);
}
/// <summary>
/// Generates all combinations of features for a given degree.
/// </summary>
private void GenerateDegreeCombinations(Vector<T> input, int degree, List<T> features)
{
// Generate all combinations of 'degree' features
var indices = new int[degree];
GenerateCombinationsRecursive(input, indices, 0, 0, degree, features);
}
/// <summary>
/// Recursively generates feature combinations.
/// </summary>
private void GenerateCombinationsRecursive(
Vector<T> input,
int[] indices,
int start,
int depth,
int maxDepth,
List<T> features)
{
if (depth == maxDepth)
{
// Calculate product of selected features
T product = NumOps.One;
bool isInteraction = indices.Distinct().Count() > 1;
// Skip if interaction_only=true and this is a power term
if (_interactionOnly && !isInteraction)
{
return;
}
for (int i = 0; i < indices.Length; i++)
{
product = NumOps.Multiply(product, input[indices[i]]);
}
features.Add(product);
return;
}
// Generate combinations
for (int i = start; i < input.Length; i++)
{
indices[depth] = i;
GenerateCombinationsRecursive(input, indices, i, depth + 1, maxDepth, features);
}
}
/// <summary>
/// Calculates the number of output features.
/// </summary>
private int CalculateOutputFeatureCount()
{
int count = 0;
if (_includeBias)
count++;
// Original features
count += _inputFeatures;
// Higher degree combinations
for (int d = 2; d <= _degree; d++)
{
if (_interactionOnly)
{
// Only interaction terms: C(n, d)
count += Combinations(_inputFeatures, d);
}
else
{
// All combinations with replacement: C(n+d-1, d)
count += Combinations(_inputFeatures + d - 1, d);
}
}
return count;
}
/// <summary>
/// Calculates binomial coefficient C(n, k).
/// </summary>
private int Combinations(int n, int k)
{
if (k > n) return 0;
if (k == 0 || k == n) return 1;
// Use the formula C(n,k) = n! / (k! * (n-k)!)
// Optimized to avoid overflow
int result = 1;
for (int i = 1; i <= k; i++)
{
result = result * (n - i + 1) / i;
}
return result;
}
public override Dictionary<string, string> GetMetadata()
{
var metadata = base.GetMetadata();
metadata["Degree"] = _degree.ToString();
metadata["IncludeBias"] = _includeBias.ToString();
metadata["InteractionOnly"] = _interactionOnly.ToString();
metadata["InputFeatures"] = _inputFeatures.ToString();
metadata["OutputFeatures"] = IsFitted ? CalculateOutputFeatureCount().ToString() : "Unknown";
return metadata;
}
}
Phase 4: Complete Pipeline Example
AC 4.1: End-to-End Pipeline Usage
// File: examples/PipelineExample.cs
public class PipelineExample
{
public async Task RunExample()
{
// Create sample data
var trainX = new Matrix<double>(new[,] {
{ 1.0, 2.0 },
{ 2.0, 4.0 },
{ 3.0, 6.0 },
{ 4.0, 8.0 }
});
var trainY = new Matrix<double>(new[,] {
{ 5.0 },
{ 10.0 },
{ 15.0 },
{ 20.0 }
});
// Build pipeline
var pipeline = new Pipeline<double, Matrix<double>, Matrix<double>>()
.AddStep(new StandardScalerStep<double>()) // Step 1: Standardize
.AddStep(new PolynomialFeaturesStep<double>(degree: 2)) // Step 2: Add polynomial features
.AddStep(new MinMaxScalerStep<double>()); // Step 3: Scale to [0,1]
// Fit pipeline on training data
await pipeline.FitAsync(trainX);
// Transform training data
var trainTransformed = await pipeline.TransformAsync(trainX);
// Transform test data (uses statistics from training)
var testX = new Matrix<double>(new[,] {
{ 2.5, 5.0 },
{ 3.5, 7.0 }
});
var testTransformed = await pipeline.TransformAsync(testX);
// Save pipeline parameters
var parameters = pipeline.GetAllParameters();
// ... serialize parameters to JSON/file
// Load pipeline parameters later
var newPipeline = new Pipeline<double, Matrix<double>, Matrix<double>>()
.AddStep(new StandardScalerStep<double>())
.AddStep(new PolynomialFeaturesStep<double>(degree: 2))
.AddStep(new MinMaxScalerStep<double>());
newPipeline.SetAllParameters(parameters);
// Use loaded pipeline
var predictions = await newPipeline.TransformAsync(testX);
}
}
Common Pitfalls to Avoid
-
Wrong Step Order: Order matters!
// WRONG - Polynomial after scaling makes tiny values .AddStep(new StandardScaler()) .AddStep(new PolynomialFeatures(degree: 3)) // Creates tiny^3 values // CORRECT - Polynomial before final scaling .AddStep(new PolynomialFeatures(degree: 3)) .AddStep(new StandardScaler()) -
Forgetting to Fit: Must fit before transform
// WRONG var result = await pipeline.TransformAsync(data); // Error! // CORRECT await pipeline.FitAsync(trainData); var result = await pipeline.TransformAsync(testData); -
Type Mismatches: Pipeline steps must have compatible types
// Step output type must match next step input type // Most steps: Matrix<T> → Matrix<T> -
Data Leakage: Never fit on test data
// WRONG await pipeline.FitAsync(testData); // CORRECT await pipeline.FitAsync(trainData); await pipeline.TransformAsync(testData);
Testing Strategy
Unit Tests
[Fact]
public async Task Pipeline_FitTransform_ExecutesStepsInOrder()
{
// Test that steps execute sequentially
// Test that each step receives correct input
}
[Fact]
public async Task Pipeline_SaveLoad_PreservesParameters()
{
// Fit pipeline, save parameters
// Create new pipeline, load parameters
// Verify identical transforms
}
[Fact]
public async Task PolynomialFeatures_Degree2_GeneratesCorrectFeatures()
{
// Input: [2, 3]
// Expected: [1, 2, 3, 4, 6, 9]
}
Integration Tests
[Fact]
public async Task EndToEndPipeline_TrainTest_ProducesConsistentResults()
{
// Build full pipeline
// Fit on train, transform both train and test
// Verify no data leakage
}
Next Steps
- Implement Pipeline class
- Create PipelineStepBase
- Implement StandardScalerStep, MinMaxScalerStep, RobustScalerStep
- Implement PolynomialFeaturesStep
- Write comprehensive tests
- Add pipeline serialization (JSON)
- Create integration examples
Estimated Effort: 4-5 days for a junior developer
Files to Create:
-
src/Preprocessing/Pipeline.cs -
src/Preprocessing/PipelineStepBase.cs -
src/Preprocessing/StandardScalerStep.cs -
src/Preprocessing/MinMaxScalerStep.cs -
src/Preprocessing/RobustScalerStep.cs -
src/Preprocessing/PolynomialFeaturesStep.cs -
tests/UnitTests/Preprocessing/PipelineTests.cs -
tests/UnitTests/Preprocessing/PolynomialFeaturesTests.cs