AiDotNet icon indicating copy to clipboard operation
AiDotNet copied to clipboard

Create GlooSharp NuGet Package for Native Gloo C++ Integration

Open ooples opened this issue 2 months ago • 0 comments

Create GlooSharp NuGet Package for Native Gloo C++ Integration

User Story

As a distributed training developer using AiDotNet on high-performance compute clusters, I want native Gloo library integration through a GlooSharp NuGet package, so that I can leverage optimized collective operations for CPU and InfiniBand hardware without falling back to TCP-only implementations.


Problem Statement

Current State:

The GlooCommunicationBackend<T> class (src/DistributedTraining/GlooCommunicationBackend.cs) contains detection logic for a "GlooSharp" package that does not exist on NuGet.org:

// Line 104-124: Dead code - GlooSharp package does not exist
var glooType = Type.GetType("Gloo.Context, GlooSharp");
if (glooType != null)
{
    // This code path is never reached
    throw new NotImplementedException(
        "GlooCommunicationBackend with Gloo library support is not yet fully implemented...");
}

Problems with Current Approach:

  1. Non-Existent Dependency: Code references "GlooSharp" package that doesn't exist on NuGet.org
  2. Always Falls Back to TCP: Detection always fails, forcing TCP mode even if user wants native Gloo
  3. Misleading Documentation: Code comments suggest Gloo integration exists when it doesn't
  4. No InfiniBand Support: TCP fallback doesn't support high-performance RDMA networks
  5. Performance Gap: TCP implementation is production-ready but significantly slower than native Gloo on supported hardware

Impact:

  • Users on InfiniBand clusters cannot use native RDMA for collective operations
  • High-performance computing (HPC) environments limited to TCP performance
  • No way to leverage Gloo's hardware-specific optimizations (even if user installs native Gloo)

Proposed Solution

Create GlooSharp - a .NET wrapper NuGet package providing P/Invoke bindings to the native Gloo C++ library.

Design Philosophy

  1. Optional Dependency: GlooSharp is an optional package users install when they need native Gloo performance
  2. Platform-Specific Binaries: Include native Gloo libraries for Windows, Linux, and macOS
  3. Graceful Fallback: If GlooSharp isn't installed, GlooCommunicationBackend continues using TCP mode
  4. Zero Breaking Changes: Existing code continues working without GlooSharp
  5. Production Ready: Only ship when P/Invoke bindings are stable and tested

Definition of Done

  • [ ] Gloo C++ library built for Windows, Linux, macOS
  • [ ] Native binaries packaged in runtimes structure
  • [ ] GlooSharp project created with P/Invoke bindings
  • [ ] Core collective operations implemented (AllReduce, Broadcast, AllGather, Barrier)
  • [ ] GlooCommunicationBackend updated to detect and use GlooSharp
  • [ ] TCP fallback still works when GlooSharp not installed
  • [ ] All unit tests pass
  • [ ] GlooSharp.nuspec created and package published to NuGet.org as preview
  • [ ] Documentation complete with examples
  • [ ] No breaking changes to existing AiDotNet API

Open Questions

  1. Gloo Version: Which version of Gloo should we target? (Recommend: latest stable)
  2. InfiniBand Support: Should v0.1.0 include ibverbs, or defer to v0.2.0?
  3. CUDA Support: Is GPU-direct Gloo support in scope? (Recommend: future enhancement)
  4. Licensing: Gloo is MIT licensed - confirm compatibility with AiDotNet license
  5. Maintenance: Who maintains native binary builds when Gloo updates? (CI automation?)

Related Issues

  • Code cleanup: Remove non-existent GlooSharp references from GlooCommunicationBackend.cs
  • Future: Add NCCL-style GPU collectives via separate package

Estimated Effort: 65 story points (significant native library integration work)

Priority: Medium - TCP implementation is production-ready, Gloo is performance optimization

Notes:

  • This is a significant undertaking requiring C++ build expertise
  • Alternative: Partner with or sponsor existing Gloo .NET binding projects if they exist
  • Consider creating as separate GitHub repository (GlooSharp) to avoid bloating main AiDotNet repo

ooples avatar Nov 09 '25 18:11 ooples