TabPFN
TabPFN copied to clipboard
[Long-term] Investigate and improve cross-platform prediction consistency
Issue Description
When running TabPFN consistency tests across different platforms (e.g., macOS vs Linux, x86 vs ARM), we've observed significant differences in model predictions.
Current Observations:
-
Despite using , regression predictions on diabetes dataset still show differences:
- On macOS (ARM):
- On Linux CI:
- Difference: ~2.34 (about ~1.6% relative difference)
-
Classification predictions seem more stable but still show small variations
Impact:
- Makes it difficult to have reproducible research/benchmarks across platforms
- Requires platform-specific consistency tests (as implemented in PR #217)
- Could affect production deployments across different infrastructures
Potential Causes:
- Different CPU architectures (x86 vs. ARM)
- Different BLAS/LAPACK implementations
- OS-specific optimizations
- Compiler-specific floating-point optimizations
Suggested Solutions to Investigate:
- More aggressive precision control beyond sklearn's 16-decimal option
- Implementation of deterministic mode that sacrifices some performance for better consistency
- Platform detection with environment-specific reference values
- Custom normalization/scaling approaches that are more robust to platform differences
Related PR:
PR #217 worked around this by making consistency tests platform-specific, but we should investigate a more fundamental solution.
Priority:
Medium - This is not breaking functionality but affects reproducibility