libact
libact copied to clipboard
Identify whether the relabeling in sklearn will cause problem
Since sklearn internally relabels the given label to 0-n_labels. If I get it correctly, they do it in the order of data sending into the fit method. So if after we updated an unlabeled data and cause the order of data sending into fit method to change. The value from predict_real method of our model might have wrong order. One proposal for solving this problem could be manage relabeling set ourself in the model classes.
@yangarbiter could you elaborate on the details of this issue?
@lazywei I've update the details. Thanks.
Are you talking about the sklearn.preprocessing.LabelEncoder
? If so, they indeed did that through applying np.unique
on the label vector.
How about moving the missing labels to the end of the label vector and maintaining an "internal index" in our model?
def get_internal_idx(y):
n_samples = y.shape[0]
nan_idx = np.argwhere(np.isnan(y))[:, 0]
ttl_idx = list(range(n_samples))
s = set(nan_idx)
return [_idx for _idx in ttl_idx if _idx not in s] + nan_idx.tolist()
y = np.array([np.nan, 0, 1, np.nan, 1])
intl_idx = get_internal_idx(y)
print(intl_idx)
# => [1, 2, 4, 0, 3]
intl_y = y[intl_idx]
print(intl_y)
# => [ 0. 1. 1. nan nan]
The checking return [_idx for ...]
is order-preserved. And if performance is a concern, this checking is better than O(m * n)
where m and n is the number of non-nan and nan elements.
We can get this intl_idx
for the first time training. And for the future updating, we always stick to this index. Then, we are guranteed to have the same "label order" no matter what label those "nan"'s are labeled when updating.
How do you think?
I think maintaining an "internal index" and decode it before it outputs from our model should be enough (we can swap the row to the right order before return from predict_real). And simply add a new label to internal label if new label comes ( #9 ).
Though if we want to move the missing labels to the end of the label, we might have to also consider this issue #9 . Since if we do so, it might be harder to extend to the situation which labeled pool contain only a subset of all possible labels.