polis icon indicating copy to clipboard operation
polis copied to clipboard

Fix PCA NaN handling for slightly better practice

Open jucor opened this issue 3 months ago • 5 comments

Summary

  • Fix Python PCA to use column means for missing votes (instead of 0), matching the slightly healthier Clojure behavior: better bias towards average than towards zero. Still not ideal and we'll use a better way to do that, but for now that's a first step.
  • Add regression tests comparing Python PCA output against Clojure math_blob
  • Add unit tests verifying NaN handling uses column means

Background

Investigation revealed that Python and Clojure PCA implementations differed significantly:

  • Root cause: Python filled NaN with 0, Clojure fills with column mean
  • Impact: VW dataset had 16° and 52° angle differences in PC1/PC2

Changes

Core fix (pca.py)

  • pca_project_dataframe: Replace NaN with column means instead of 0
  • powerit_pca: Add ValueError if NaN values reach it (enforces preprocessing)

New tests

  • test_pca_unit.py: Two tests verifying column-mean NaN handling
  • test_legacy_clojure_regression.py: test_pca_components_match_clojure comparing Python vs Clojure PCA

Results after fix

Dataset PC1 Correlation PC1 Angle PC2 Correlation PC2 Angle
VW 1.0000 -1.0000
Biodiversity 0.9920 7.1° -0.9977 3.6°

The small biodiversity discrepancy is within numerical tolerance for power iteration on high-sparsity (82%) data.

Test plan

  • [x] pytest tests/test_pca_unit.py -k nan_handling - NaN handling unit tests pass
  • [x] pytest tests/test_legacy_clojure_regression.py::test_pca_components_match_clojure - Clojure comparison passes
  • [ ] Full test suite

🤖 Generated with Claude Code

jucor avatar Nov 25 '25 15:11 jucor

Note: next step will be to :

  • [x] fix the test runner that can't load -- something probably changed under our feet :)
  • [ ] try comparison on other convos (might be useful to add that possibility in the test_regression, in a clean optional way that does not slow down the CI...)
  • [ ] find a way to get the vote matrix as Clojure uses it, to check if there's any difference to the input of the PCA, given there's a small error on biodiversity. I want to check if it's indeed just numeric in the computation, or if there's something else.
  • [ ] update golden record.
  • [ ] remove the now-dead code that hacks a scaling to match clojure vector norms

jucor avatar Nov 25 '25 16:11 jucor

Don't mind the failure of the python/python regression tests: the golden record was computer before the fix, and needs to be updated.

However:

  • 1/ Gosh the creation of the fixture on tests/test_legacy_clojure_regression.py is slow... Gotta be a way to speed this up. Polars instead of Pandas? Need to add more timers everywhere.
  • 2/ The clojure PCA comparison, which passes on VW and on biodiversity, fails on BG2050
tests/test_legacy_clojure_regression.py::test_pca_components_match_clojure[bg2050] 
[bg2050] Testing PCA components match Clojure...
  PC1 correlation: 0.994594
  PC1 angle difference: 44.89°
  PC2 correlation: -0.983012
  PC2 angle difference: 53.64°
FAILED

========================================================== FAILURES ==========================================================

FAILURE: check 44.889375337151506 <= 10.0: PC1 angle difference should be ≤10° (got 44.89°)
tests/test_legacy_clojure_regression.py:344 in test_pca_components_match_clojure() -> check.less_equal(angle_deg, 10.0,

FAILURE: check 53.64306927244186 <= 10.0: PC2 angle difference should be ≤10° (got 53.64°)
------------------------------------------------------------
Failed Checks: 2
tests/test_legacy_clojure_regression.py::test_pca_components_match_clojure[bg2050]:0: check 44.889375337151506 <= 10.0: PC1 angle difference should be ≤10° (got 44.89°)

So there's something still amiss. Sign flip can't explain the 45° degrees angle difference on PC1 ... or the 53.64° on PC2 !

jucor avatar Nov 25 '25 22:11 jucor

Investigation: Why PCA regression tests fail for large conversations (e.g., BG2050)

Summary

The Python PCA implementation matches Clojure for small conversations but diverges significantly for large ones. After investigation, we found this is expected behavior due to different algorithms being used—and we're now confident that dropping the Clojure stochastic approach is the right call.

Root Cause

The Clojure code has a branching condition (conversation.clj:766-796): conversations exceeding 5,000 comments or 10,000 participants switch from exact PCA to a stochastic mini-batch approximation.

Dataset Comments Algorithm Used
vw, biodiversity < 5,000 Exact PCA (same as Python) ✅
BG2050 7,757 Stochastic mini-batch PCA ❌

The stochastic algorithm is clever—it uses growing-size mini-batches with a learning rate, which is exactly the pattern found in convergence proofs and optimal algorithms for stochastic Monte Carlo methods. However, I could not find any trace in the codebase of comparisons against the exact algorithm for accuracy validation.

The Scary-Looking 50° Angle Difference (and why it's not actually scary)

Initial tests showed a ~50° angle difference between Python and Clojure PCA components for BG2050, which looked alarming. But we also observed 0.99 correlation between the same components—seemingly contradictory.

Resolution: The stochastic PCA doesn't re-normalize its components after blending iterations. The mini-batch update does:

new_component = 0.99 * old_component + 0.01 * batch_component

A weighted average of unit vectors is not a unit vector. After 100 iterations, the Clojure vectors have norms ~0.6-0.7 instead of 1.0.

My angle computation assumed unit vectors (as exact PCA produces), which gave misleading results. Once properly normalized:

  • The vectors point in nearly the same direction (high correlation confirmed)
  • The ~40% norm difference would cause non-uniform scaling of the opinion map—not ideal, but not catastrophic
  • BG2050 results weren't wrong, just slightly inaccurate

Decision

Drop the stochastic PCA codepath.

The exact PCA in Python (currently custom, but soon backed by efficient NumPy) will handle what used to be computationally challenging for Clojure. If we ever need to optimize for truly massive conversations, we can revisit with:

  • Proper comparison tests between algorithms
  • Battle-tested libraries (scikit-learn's incremental PCA, etc.)
  • Actual validation that approximations match the exact solution

jucor avatar Nov 29 '25 00:11 jucor

Now that we have verified the PCA algorithm and trust it, next steps:

  • [ ] Re-record Golden records with the fixed PCA we have. Remember from previous commits: in addition to the clojure comparison discussed above, we had also fixed an issue on how the Python PCA handles the NaN, and verified that this tracks with Clojure for small convos. NB: for large convos (not in the committed real_data), changing the PCA result will change the clustering, of course. But that clustering isn't really checked for regression wrt Clojure anyway (yet).
  • [ ] Optional: modify the legacy clojure comparison to be more lenient on large convos. Not ideal, but allows it to pass on the large convos. Then again, those are not committed to this repo, and are used for debug only.
  • [ ] Then move to sklearn PCA, and see how faster that is ;-)

jucor avatar Nov 29 '25 00:11 jucor

Parallelized test execution across datasets

Added xdist_group markers to enable parallel test execution by dataset. This significantly speeds up local testing when running against many large conversations.

Local benchmarks (Apple Silicon M1, 64GB RAM):

Dataset Participants Comments Exact PCA time
bg2050 6,624 7,730 15s
EB 4,222 7,728 11s
P 8,308 8,048 53s

Full test suite with all local datasets: ~3 min (down from 10-12 min sequential).

Notes:

  • The speedup matters less on CI, which only uses vw and biodiversity to keep runner load light
  • For local debugging/algorithm development with large datasets, the parallel execution is very useful
  • Memory footprint is reasonable (e.g. bg2050 ~400MB for full matrix, to which we should add PCA working space -- but since we're asking for only 2 components this should be reasonable)
  • Current implementation uses manual NumPy-based PCA; switching to scikit-learn could leverage optimized BLAS/LAPACK routines for further speedup

The parallel execution infrastructure is now available if we want to enable it on CI runners in the future.

jucor avatar Nov 29 '25 01:11 jucor