[Long-term] Investigate and improve cross-platform prediction consistency

Open noahho opened this issue 9 months ago • 0 comments

Issue Description

When running TabPFN consistency tests across different platforms (e.g., macOS vs Linux, x86 vs ARM), we've observed significant differences in model predictions.

Current Observations:

Despite using , regression predictions on diabetes dataset still show differences:
- On macOS (ARM):
- On Linux CI:
- Difference: ~2.34 (about ~1.6% relative difference)
Classification predictions seem more stable but still show small variations

Impact:

Makes it difficult to have reproducible research/benchmarks across platforms
Requires platform-specific consistency tests (as implemented in PR #217)
Could affect production deployments across different infrastructures

Potential Causes:

Different CPU architectures (x86 vs. ARM)
Different BLAS/LAPACK implementations
OS-specific optimizations
Compiler-specific floating-point optimizations

Related PR:

PR #217 worked around this by making consistency tests platform-specific, but we should investigate a more fundamental solution.

Priority:

Medium - This is not breaking functionality but affects reproducibility

Mar 01 '25 17:03 noahho

[Long-term] Investigate and improve cross-platform prediction consistency

Issue Description

Current Observations:

Impact:

Potential Causes:

Suggested Solutions to Investigate:

Related PR:

Priority: