Fix PCA NaN handling for slightly better practice
Summary
- Fix Python PCA to use column means for missing votes (instead of 0), matching the slightly healthier Clojure behavior: better bias towards average than towards zero. Still not ideal and we'll use a better way to do that, but for now that's a first step.
- Add regression tests comparing Python PCA output against Clojure math_blob
- Add unit tests verifying NaN handling uses column means
Background
Investigation revealed that Python and Clojure PCA implementations differed significantly:
- Root cause: Python filled NaN with 0, Clojure fills with column mean
- Impact: VW dataset had 16° and 52° angle differences in PC1/PC2
Changes
Core fix (pca.py)
-
pca_project_dataframe: Replace NaN with column means instead of 0 -
powerit_pca: Add ValueError if NaN values reach it (enforces preprocessing)
New tests
-
test_pca_unit.py: Two tests verifying column-mean NaN handling -
test_legacy_clojure_regression.py:test_pca_components_match_clojurecomparing Python vs Clojure PCA
Results after fix
| Dataset | PC1 Correlation | PC1 Angle | PC2 Correlation | PC2 Angle |
|---|---|---|---|---|
| VW | 1.0000 | 0° | -1.0000 | 0° |
| Biodiversity | 0.9920 | 7.1° | -0.9977 | 3.6° |
The small biodiversity discrepancy is within numerical tolerance for power iteration on high-sparsity (82%) data.
Test plan
- [x]
pytest tests/test_pca_unit.py -k nan_handling- NaN handling unit tests pass - [x]
pytest tests/test_legacy_clojure_regression.py::test_pca_components_match_clojure- Clojure comparison passes - [ ] Full test suite
🤖 Generated with Claude Code
Note: next step will be to :
- [x] fix the test runner that can't load -- something probably changed under our feet :)
- [ ] try comparison on other convos (might be useful to add that possibility in the test_regression, in a clean optional way that does not slow down the CI...)
- [ ] find a way to get the vote matrix as Clojure uses it, to check if there's any difference to the input of the PCA, given there's a small error on
biodiversity. I want to check if it's indeed just numeric in the computation, or if there's something else. - [ ] update golden record.
- [ ] remove the now-dead code that hacks a scaling to match clojure vector norms
Don't mind the failure of the python/python regression tests: the golden record was computer before the fix, and needs to be updated.
However:
- 1/ Gosh the creation of the fixture on
tests/test_legacy_clojure_regression.pyis slow... Gotta be a way to speed this up. Polars instead of Pandas? Need to add more timers everywhere. - 2/ The clojure PCA comparison, which passes on VW and on biodiversity, fails on BG2050
tests/test_legacy_clojure_regression.py::test_pca_components_match_clojure[bg2050]
[bg2050] Testing PCA components match Clojure...
PC1 correlation: 0.994594
PC1 angle difference: 44.89°
PC2 correlation: -0.983012
PC2 angle difference: 53.64°
FAILED
========================================================== FAILURES ==========================================================
FAILURE: check 44.889375337151506 <= 10.0: PC1 angle difference should be ≤10° (got 44.89°)
tests/test_legacy_clojure_regression.py:344 in test_pca_components_match_clojure() -> check.less_equal(angle_deg, 10.0,
FAILURE: check 53.64306927244186 <= 10.0: PC2 angle difference should be ≤10° (got 53.64°)
------------------------------------------------------------
Failed Checks: 2
tests/test_legacy_clojure_regression.py::test_pca_components_match_clojure[bg2050]:0: check 44.889375337151506 <= 10.0: PC1 angle difference should be ≤10° (got 44.89°)
So there's something still amiss. Sign flip can't explain the 45° degrees angle difference on PC1 ... or the 53.64° on PC2 !
Investigation: Why PCA regression tests fail for large conversations (e.g., BG2050)
Summary
The Python PCA implementation matches Clojure for small conversations but diverges significantly for large ones. After investigation, we found this is expected behavior due to different algorithms being used—and we're now confident that dropping the Clojure stochastic approach is the right call.
Root Cause
The Clojure code has a branching condition (conversation.clj:766-796): conversations exceeding 5,000 comments or 10,000 participants switch from exact PCA to a stochastic mini-batch approximation.
| Dataset | Comments | Algorithm Used |
|---|---|---|
| vw, biodiversity | < 5,000 | Exact PCA (same as Python) ✅ |
| BG2050 | 7,757 | Stochastic mini-batch PCA ❌ |
The stochastic algorithm is clever—it uses growing-size mini-batches with a learning rate, which is exactly the pattern found in convergence proofs and optimal algorithms for stochastic Monte Carlo methods. However, I could not find any trace in the codebase of comparisons against the exact algorithm for accuracy validation.
The Scary-Looking 50° Angle Difference (and why it's not actually scary)
Initial tests showed a ~50° angle difference between Python and Clojure PCA components for BG2050, which looked alarming. But we also observed 0.99 correlation between the same components—seemingly contradictory.
Resolution: The stochastic PCA doesn't re-normalize its components after blending iterations. The mini-batch update does:
new_component = 0.99 * old_component + 0.01 * batch_component
A weighted average of unit vectors is not a unit vector. After 100 iterations, the Clojure vectors have norms ~0.6-0.7 instead of 1.0.
My angle computation assumed unit vectors (as exact PCA produces), which gave misleading results. Once properly normalized:
- The vectors point in nearly the same direction (high correlation confirmed)
- The ~40% norm difference would cause non-uniform scaling of the opinion map—not ideal, but not catastrophic
- BG2050 results weren't wrong, just slightly inaccurate
Decision
Drop the stochastic PCA codepath.
The exact PCA in Python (currently custom, but soon backed by efficient NumPy) will handle what used to be computationally challenging for Clojure. If we ever need to optimize for truly massive conversations, we can revisit with:
- Proper comparison tests between algorithms
- Battle-tested libraries (scikit-learn's incremental PCA, etc.)
- Actual validation that approximations match the exact solution
Now that we have verified the PCA algorithm and trust it, next steps:
- [ ] Re-record Golden records with the fixed PCA we have. Remember from previous commits: in addition to the clojure comparison discussed above, we had also fixed an issue on how the Python PCA handles the NaN, and verified that this tracks with Clojure for small convos. NB: for large convos (not in the committed real_data), changing the PCA result will change the clustering, of course. But that clustering isn't really checked for regression wrt Clojure anyway (yet).
- [ ] Optional: modify the legacy clojure comparison to be more lenient on large convos. Not ideal, but allows it to pass on the large convos. Then again, those are not committed to this repo, and are used for debug only.
- [ ] Then move to sklearn PCA, and see how faster that is ;-)
Parallelized test execution across datasets
Added xdist_group markers to enable parallel test execution by dataset. This significantly speeds up local testing when running against many large conversations.
Local benchmarks (Apple Silicon M1, 64GB RAM):
| Dataset | Participants | Comments | Exact PCA time |
|---|---|---|---|
| bg2050 | 6,624 | 7,730 | 15s |
| EB | 4,222 | 7,728 | 11s |
| P | 8,308 | 8,048 | 53s |
Full test suite with all local datasets: ~3 min (down from 10-12 min sequential).
Notes:
- The speedup matters less on CI, which only uses
vwandbiodiversityto keep runner load light - For local debugging/algorithm development with large datasets, the parallel execution is very useful
- Memory footprint is reasonable (e.g. bg2050 ~400MB for full matrix, to which we should add PCA working space -- but since we're asking for only 2 components this should be reasonable)
- Current implementation uses manual NumPy-based PCA; switching to scikit-learn could leverage optimized BLAS/LAPACK routines for further speedup
The parallel execution infrastructure is now available if we want to enable it on CI runners in the future.