cuml icon indicating copy to clipboard operation
cuml copied to clipboard

[FEA] Implement `adjusted_mutual_info_score`

Open jpintar opened this issue 3 months ago • 11 comments

It would be great if adjusted_mutual_info_score were made available in cuml, esp. since there is some evidence that it is preferable to the ARI in the presence of unbalanced cluster sizes, and is slower to calculate.

jpintar avatar Oct 02 '25 01:10 jpintar

Looking at https://github.com/scikit-learn/scikit-learn/blob/86906384af8ff337557486d262e88ef3c83c824f/sklearn/metrics/cluster/_supervised.py#L946 I think we have a few of the components that are used to build this already. I've not followed all the helpers used and the helpers used in the helpers, etc. Most of the work is probably around input validation, then about taking care of edge cases when computing the score.

betatim avatar Oct 03 '25 06:10 betatim

@jpintar Thank you very much for the suggestion.

csadorf avatar Oct 03 '25 13:10 csadorf

Hi @jpintar, I’d like to work on implementing adjusted_mutual_info_score in cuML. Looking forward to contributing!

SumitdevelopAI avatar Oct 08 '25 16:10 SumitdevelopAI

@SumitdevelopAI I've assigned you to the issue. I'd recommend that you propose an implementation strategy before jumping in.

csadorf avatar Oct 08 '25 17:10 csadorf

Thanks @csadorf — I’ll take this on. Below is my proposed implementation strategy (staged, reviewable, and designed to match sklearn semantics while enabling a later GPU-optimized EMI).

Proposed implementation strategy for adjusted_mutual_info_score in cuML

High-level goal

Implement cuml.metrics.cluster.adjusted_mutual_info_score(labels_true, labels_pred, *, average_method='arithmetic', handle=None, convert_dtype=True) that is behaviorally compatible with sklearn.metrics.adjusted_mutual_info_score, reusing cuML building blocks and providing a correctness-first CPU reference EMI, followed by a GPU-optimized EMI in a follow-up PR.


Stage A — Correctness-first (single small PR)

Goal: deliver correct results (match sklearn) with minimal reviewer friction.

What I will add

  • python/cuml/metrics/cluster/adjusted_mutual_info_score.py — Python wrapper implementing the public API and input validation (match sklearn check_clusterings semantics).
  • python/cuml/metrics/cluster/_expected_mutual_info_ref.py — CPU reference implementation of EMI using the hypergeometric per-cell recurrence (lgamma init + iterative recurrence). This is correctness-first and easy to review.
  • Unit tests: python/cuml/metrics/cluster/tests/test_adjusted_mutual_info.py comparing outputs to sklearn.metrics.adjusted_mutual_info_score for deterministic edge cases and randomized tests (imbalanced clusters included).
  • Minimal docs / docstring and a short example.

Implementation notes

  • Reuse cuML contingency-building and mutual_info_score/entropy helpers to obtain C, a, b, N and MI/H(U)/H(V).
  • EMI computed on CPU initially — exact per-cell sums using:
    • nmin = max(0, ai + bj - N), nmax = min(ai, bj)
    • P(n+1) = P(n) * ((ai - n)*(bj - n)) / ((n+1)*(N - ai - bj + n + 1))
    • accumulate E_MI = Σ_{i,j} Σ_n f(n) * P(n) where f(n) = per-cell MI term.
  • Numeric: use float64, compute initial P via lgamma where needed, and adopt sklearn’s small-value clipping for numerator/denominator to avoid 0/0.

Why Stage A first: reviewers can validate correctness easily; users get AMI immediately. Performance work is staged later.


Stage B — GPU EMI (follow-up PR)

Goal: port EMI heavy-lift to CUDA/C++ for speed while preserving numerical semantics.

Candidate kernel designs

  • One-block-per-nonempty-cell: each block computes P(n) recurrence for that cell and reduces f(n)*P(n).
  • Row-parallel scheme: distribute columns across threads for rows with many columns.
  • Use float64 accumulators and reduce with warp/shmem reductions. Start P at modal n where practical to avoid underflow; otherwise compute initial P via log-gamma and iterate.

Fallback heuristics

  • If r * c (unique_true * unique_pred) > threshold (tunable), fallback to CPU EMI or use large-N approximation (AMI ≈ NMI). I will add configurable thresholds & documented behavior.

Deliverables

  • cpp/src/metrics/expected_mutual_information.cu + bindings
  • Python shim: python/cuml/metrics/cluster/_expected_mutual_info_gpu.py
  • Microbenchmarks comparing CPU reference and GPU kernel.

Stage C — polish, CI & docs

  • Add benchmarks script, performance notes to docs, and CI test selection (fast correctness tests only).
  • Add changelog and example notebooks.

Test & validation plan

  • Unit tests: deterministic edge cases + randomized seeds (imbalanced & extreme).
  • Numerical parity: np.testing.assert_allclose(cuml_ami, sklearn_ami, rtol=1e-8) (relax if GPU rounding differences appear).
  • Perf tests: representative medium/large workloads (kept out of baseline CI; optional benchmark job).

Files to change (suggested)

  • python/cuml/metrics/cluster/adjusted_mutual_info_score.py (wrapper)
  • python/cuml/metrics/cluster/_expected_mutual_info_ref.py (CPU EMI)
  • python/cuml/metrics/cluster/_expected_mutual_info_gpu.py (GPU shim, PR2)
  • cpp/src/metrics/expected_mutual_information.cu (PR2)
  • python/cuml/metrics/cluster/tests/test_adjusted_mutual_info.py
  • benchmarks/metrics/ami_bench.py (optional)

References / provenance

  • Implementation will follow the permutation/hypergeometric formulation used by sklearn._expected_mutual_info_fast and Romano et al. (recurrence, complexity bounds, large-N approx). I will cite sklearn + the paper in the PR.

If this approach looks good I will prepare Stage A: a small PR with the Python wrapper, CPU EMI reference implementation, tests, and docs. Please let me know if you want a different fallback policy or stricter tolerances in tests — otherwise I’ll prepare the PR and link it to this issue.

SumitdevelopAI avatar Oct 09 '25 04:10 SumitdevelopAI

@SumitdevelopAI Thanks for the detailed plan. While I appreciate the thoroughness, the proposal is overly verbose and over-engineered for what's ultimately a single metric function. Let's simplify the approach:

Please implement everything in a single PR with the following structure:

  1. APIs to add (matching sklearn module structure):

    • cuml.metrics.cluster.adjusted_mutual_info_score(labels_true, labels_pred, *, average_method='arithmetic')
    • cuml.metrics.cluster.expected_mutual_information(contingency)
  2. Implementation approach:

    • Implement the functions adjusted_mutual_info_score and expected_mutual_information in cuml.metrics.cluster._adjusted_mutual_info_score and cuml.metrics.cluster._expected_mutual_information respectively.
    • The adjusted_mutual_info_score function should be importable from cuml.metrics.cluster and cuml.metrics. The expected_mutual_information should be importable only from cuml.metrics.cluster.
    • Reuse existing cuML components where possible (e.g., cuml.metrics.cluster.mutual_info_score())
    • Use CuPy for everything else; essentially translate the sklearn implementation to CuPy arrays/operations
    • If you encounter CuPy limitations during implementation, please raise them and we can discuss workarounds
    • We will evaluate the need for dedicated CUDA kernels and calls into RAFT/cuVS etc. in a second stage
  3. Testing:

    • Compare outputs directly against sklearn.metrics.adjusted_mutual_info_score and sklearn.metrics.cluster.expected_mutual_information
    • No need for a separate CPU reference implementation within cuML; we can use scikit-learn as the reference

Rationale:

  • A three-stage approach delays GPU acceleration unnecessarily. Let's deliver a complete, GPU-accelerated solution in one PR.
  • CuPy provides GPU acceleration and is straightforward to implement and review. We can evaluate custom CUDA kernels as well as RAFT and cuVS calls later

csadorf avatar Oct 09 '25 21:10 csadorf

Thanks @csadorf I’ll follow the simplified approach. I will implement adjusted_mutual_info_score and expected_mutual_information in a single PR (CuPy-based, importable as you described), add tests comparing outputs to sklearn, and open the PR here once ready.

Thanks I’ll get started.

SumitdevelopAI avatar Oct 10 '25 02:10 SumitdevelopAI

Hi @csadorf

I’ve been busy with my semester exams, but I’m now available again and can continue working on whatever parts of the AMI metric still need to be finalized.

Please let me know the current status and what remaining work is required. I’m ready to help move this issue forward and get it resolved.

SumitdevelopAI avatar Nov 18 '25 15:11 SumitdevelopAI

@SumitdevelopAI Your previous PR (https://github.com/rapidsai/cuml/pull/7418) had lots of problems. I've pointed them out and closed it. You're welcome to create a new PR to address them if you like.

csadorf avatar Nov 18 '25 17:11 csadorf

Hi @csadorf, if you would still like me to implement this, please let me know. I'm a first time contributor and would love to help.

mani-builds avatar Nov 30 '25 22:11 mani-builds

@mani-builds Please feel free to open a PR.

csadorf avatar Dec 01 '25 18:12 csadorf