scikit-tree icon indicating copy to clipboard operation
scikit-tree copied to clipboard

Specifying dim_contiguous causes seg faults in PatchObliqueForestClassifier

Open j1c opened this issue 1 year ago • 2 comments

Checklist

  • [x] I have verified that the issue exists against the main branch.
  • [ ] I have read the relevant section in the contribution guide on reporting bugs.
  • [ ] I have checked the issues list for similar or identical bug reports.
  • [ ] I have checked the pull requests list for existing proposed fixes.
  • [ ] I have checked the CHANGELOG and the commit log to find out if the bug was already fixed in the main branch.
  • [ ] I have included in the "Description" section below a traceback from any exceptions related to this bug.
  • [ ] I have included in the "Related issues or possible duplicates" section beloew all related issues and possible duplicate issues (If there are none, check this box anyway).
  • [ ] I have included in the "Environment" section below the name of the operating system and Python version that I was using when I discovered this bug.
  • [ ] I have included in the "Environment" section below the output of pip freeze.
  • [x] I have included in the "Steps to reproduce" section below a minimally reproducible example.

Description

When there are non-contiguous dimensions, there are segmentation faults when fitting. See below for example.

Setting dim_contiguous=(True, True) runs fine, but dim_contiguous=(False, True) leads to seg faults.

Python traceback:

Related issues or possible duplicates

  • None

Environment

OS: Ubuntu

Python version: 3.9.16

scikit-tree version: 0.6.1

Output of pip freeze:

Steps to reproduce

Example source:

import numpy as np
from sktree import PatchObliqueRandomForestClassifier

n, a, b = 100, 10, 10

x = np.random.normal(size=(n, a, b))
y = np.random.binomial(1, 0.5, size=(n))

rf = PatchObliqueRandomForestClassifier(
    n_estimators=100,
    min_patch_dims=[1, 1],
    max_patch_dims=[4, 4],
    dim_contiguous=(False, True),
    data_dims=(a, b),
    n_jobs=-1,
)

rf.fit(x.reshape(100, -1), y)

j1c avatar Feb 07 '24 07:02 j1c

It seems this test works when n_estimators is low, but not when I ramp to 50 or maybe 100. This makes me think there is a sampling edge case error where we go out of the bounds of the image:

def test_contiguous_and_discontiguous_patch_forest():
    """Test regression reported in https://github.com/neurodata/scikit-tree/issues/215."""
    n, a, b = 100, 10, 10
    x = np.random.normal(size=(n, a, b))
    y = np.random.binomial(1, 0.5, size=(n))

    est = PatchObliqueRandomForestClassifier(
        n_estimators=20,
        min_patch_dims=[1, 1],
        max_patch_dims=[4, 4],
        dim_contiguous=(False, True),
        data_dims=(a, b),
        n_jobs=-1
    )

    est.fit(x.reshape(100, -1), y)

adam2392 avatar Feb 07 '24 16:02 adam2392

I think the issue is isolated to when dim_contiguous=False you can test this with a single tree iterated over 100-1000 different random seeds. When dim_contiguous=True , seems there is consistently no error.

It must occur in accessing an index that is not valid in one of two areas:

  • https://github.com/neurodata/scikit-tree/blob/18c2f457b7fe35d3ab95fdd48e4212d325321b09/sktree/tree/manifold/_morf_splitter.pyx#L339-L350
  • https://github.com/neurodata/scikit-tree/blob/18c2f457b7fe35d3ab95fdd48e4212d325321b09/sktree/tree/manifold/_morf_splitter.pyx#L375-L401

adam2392 avatar Feb 07 '24 17:02 adam2392