libact Identify whether the relabeling in sklearn will cause problem

Since sklearn internally relabels the given label to 0-n_labels. If I get it correctly, they do it in the order of data sending into the fit method. So if after we updated an unlabeled data and cause the order of data sending into fit method to change. The value from predict_real method of our model might have wrong order. One proposal for solving this problem could be manage relabeling set ourself in the model classes.

Dec 23 '15 06:12 yangarbiter

@yangarbiter could you elaborate on the details of this issue?

Jan 01 '16 14:01 lazywei

@lazywei I've update the details. Thanks.

Jan 01 '16 14:01 yangarbiter

Are you talking about the sklearn.preprocessing.LabelEncoder? If so, they indeed did that through applying np.unique on the label vector.

How about moving the missing labels to the end of the label vector and maintaining an "internal index" in our model?

def get_internal_idx(y):
    n_samples = y.shape[0]

    nan_idx = np.argwhere(np.isnan(y))[:, 0]
    ttl_idx = list(range(n_samples))

    s = set(nan_idx)
    return [_idx for _idx in ttl_idx if _idx not in s] + nan_idx.tolist()


y = np.array([np.nan, 0, 1, np.nan, 1])

intl_idx = get_internal_idx(y)
print(intl_idx)
# => [1, 2, 4, 0, 3]

intl_y = y[intl_idx]
print(intl_y)
# => [  0.   1.   1.  nan  nan]

The checking return [_idx for ...] is order-preserved. And if performance is a concern, this checking is better than O(m * n) where m and n is the number of non-nan and nan elements.

We can get this intl_idx for the first time training. And for the future updating, we always stick to this index. Then, we are guranteed to have the same "label order" no matter what label those "nan"'s are labeled when updating.

How do you think?

Jan 02 '16 12:01 lazywei

I think maintaining an "internal index" and decode it before it outputs from our model should be enough (we can swap the row to the right order before return from predict_real). And simply add a new label to internal label if new label comes ( #9 ).

Though if we want to move the missing labels to the end of the label, we might have to also consider this issue #9 . Since if we do so, it might be harder to extend to the situation which labeled pool contain only a subset of all possible labels.

Jan 04 '16 02:01 yangarbiter

libact libact copied to clipboard

Identify whether the relabeling in sklearn will cause problem

libact
libact copied to clipboard