Create GlooSharp NuGet Package for Native Gloo C++ Integration
Create GlooSharp NuGet Package for Native Gloo C++ Integration
User Story
As a distributed training developer using AiDotNet on high-performance compute clusters, I want native Gloo library integration through a GlooSharp NuGet package, so that I can leverage optimized collective operations for CPU and InfiniBand hardware without falling back to TCP-only implementations.
Problem Statement
Current State:
The GlooCommunicationBackend<T> class (src/DistributedTraining/GlooCommunicationBackend.cs) contains detection logic for a "GlooSharp" package that does not exist on NuGet.org:
// Line 104-124: Dead code - GlooSharp package does not exist
var glooType = Type.GetType("Gloo.Context, GlooSharp");
if (glooType != null)
{
// This code path is never reached
throw new NotImplementedException(
"GlooCommunicationBackend with Gloo library support is not yet fully implemented...");
}
Problems with Current Approach:
- Non-Existent Dependency: Code references "GlooSharp" package that doesn't exist on NuGet.org
- Always Falls Back to TCP: Detection always fails, forcing TCP mode even if user wants native Gloo
- Misleading Documentation: Code comments suggest Gloo integration exists when it doesn't
- No InfiniBand Support: TCP fallback doesn't support high-performance RDMA networks
- Performance Gap: TCP implementation is production-ready but significantly slower than native Gloo on supported hardware
Impact:
- Users on InfiniBand clusters cannot use native RDMA for collective operations
- High-performance computing (HPC) environments limited to TCP performance
- No way to leverage Gloo's hardware-specific optimizations (even if user installs native Gloo)
Proposed Solution
Create GlooSharp - a .NET wrapper NuGet package providing P/Invoke bindings to the native Gloo C++ library.
Design Philosophy
- Optional Dependency: GlooSharp is an optional package users install when they need native Gloo performance
- Platform-Specific Binaries: Include native Gloo libraries for Windows, Linux, and macOS
- Graceful Fallback: If GlooSharp isn't installed, GlooCommunicationBackend continues using TCP mode
- Zero Breaking Changes: Existing code continues working without GlooSharp
- Production Ready: Only ship when P/Invoke bindings are stable and tested
Definition of Done
- [ ] Gloo C++ library built for Windows, Linux, macOS
- [ ] Native binaries packaged in runtimes structure
- [ ] GlooSharp project created with P/Invoke bindings
- [ ] Core collective operations implemented (AllReduce, Broadcast, AllGather, Barrier)
- [ ] GlooCommunicationBackend updated to detect and use GlooSharp
- [ ] TCP fallback still works when GlooSharp not installed
- [ ] All unit tests pass
- [ ] GlooSharp.nuspec created and package published to NuGet.org as preview
- [ ] Documentation complete with examples
- [ ] No breaking changes to existing AiDotNet API
Open Questions
- Gloo Version: Which version of Gloo should we target? (Recommend: latest stable)
- InfiniBand Support: Should v0.1.0 include ibverbs, or defer to v0.2.0?
- CUDA Support: Is GPU-direct Gloo support in scope? (Recommend: future enhancement)
- Licensing: Gloo is MIT licensed - confirm compatibility with AiDotNet license
- Maintenance: Who maintains native binary builds when Gloo updates? (CI automation?)
Related Issues
- Code cleanup: Remove non-existent GlooSharp references from GlooCommunicationBackend.cs
- Future: Add NCCL-style GPU collectives via separate package
Estimated Effort: 65 story points (significant native library integration work)
Priority: Medium - TCP implementation is production-ready, Gloo is performance optimization
Notes:
- This is a significant undertaking requiring C++ build expertise
- Alternative: Partner with or sponsor existing Gloo .NET binding projects if they exist
- Consider creating as separate GitHub repository (GlooSharp) to avoid bloating main AiDotNet repo