modAL icon indicating copy to clipboard operation
modAL copied to clipboard

Ranked batch mode sampling not compatible with sklearn's transformation+classification pipeline

Open BoyanH opened this issue 4 years ago • 5 comments

It is common to build an sklearn pipeline, which includes the necessary data preprocessing (and feature encoding) steps and ends with an estimator (For example, see Column Transformer with Mixed Types). The so build pipeline can then be used as a normal classifier, where the fit(X) method also fits the correspondig data transformers and transforms the data.

However, in batch.py::select_instance(), the (dis)similiraty between the training data and the instance pool is computed directly, without any data transformation

_, distance_scores = pairwise_distances_argmin_min(X_pool_masked.reshape(n_unlabeled, -1),
                                                           X_training.reshape(n_labeled_records, -1),
                                                           metric=metric)

This is not optimal, as any feature engineering & transformations are ignored. Furthermore, it completely fails if one is using a pandas dataframe to hold the data set.

BoyanH avatar Sep 21 '20 17:09 BoyanH

I see the problem, although I am not sure what would be the optimal solution. In a sense, this problem holds for all models in modAL. I am also unsure if the query strategy should even use the transformed data where the transformation itself is learned, so this current approach might be theoretically sound. By using transformed data from a learned transformation, you essentially allow bias from the estimators to leak into the instance selection.

Do you have any suggestions? I lack the proper knowledge in this area, so probably you can come up with a better solution :)

Regarding the pandas dataframes, this is currently not supported. (See #20.)

cosmic-cortex avatar Sep 22 '20 06:09 cosmic-cortex

I agree that using learned transformations results in bias leaking in the instance selection, but I think this might be beneficial in certain scenarios. When the transformations are learned in an unsupervised manner (e.g. PCA, transforming text data via topic detection (LDA), encoding images in terms of patterns detected within), they can be viewed as feature engineering used to convert a difficult classification problem into an easier one. The transformed data can then be viewed as an easier data set, on which a classifier is trained in an active learning manner.

In cases where the data is non-numeric (e.g. text), transformations are essential for computing similarities between instances. As I am working on such a problem, I will try to come up with a solution and share it. I think all query strategies working with instance representations should have a configuration whether transformed or original data is used.

Please do take my words with a grain of salt, I also lack proper knowledge in this area and am just sharing opinion.

BoyanH avatar Sep 23 '20 15:09 BoyanH

scikit learn's transformations are learned, but every transformation I can think about is learned in an unsupervised fashion (pca, lda, svd, tfidf, ...). In that case they don't introduce bias, because when you add data to your training set thanks to active learning, the transformation is not impacted (although it would be inefficient to refit the transformation at each step of the active learning process).

edit: actually lda is supervised so this one could introduce bias, I don't know if there are many other "supervised" transformers.

damienlancry avatar Oct 21 '20 07:10 damienlancry

If I understood @cosmic-cortex correctly, the bias from the estimators leaking into the instance selection must not necessarily be in the fom of class label information.

Imagine the following dataset and transformation:

X = np.array([
      [0, 0],
      [1, 0],
      [0, 1],
      [1, 1]])
X_transformed = X[:,1:] = np.array([
      [0],
      [0],
      [1],
      [1]])

Now imagine we are using ranked batch-mode active learning to select 2/4 instances to be labeled in one batch and the model uncertainty is equal for all instances. Let's also assume that on each ranked batch-mode AL iteration, if there are multiple instances with equal scores, the first in the list will be selected (as in current implementation).

In case we are using the transformed data, this would be the result, since the similarity between e.g. [0,0] and [1,0] on transformed data is 0.

np.array([
    [0, 0],
    [0, 1]])

Where as on non-transformed:

np.array([
    [0, 0],
    [1, 1]])

The classifier using this transformation will trivially fail to distinguish between instances with different values in the 1. dimension. But if we are using ranked batch-mode AL on transformed data, any classifier trained on the collected data will also fail to do so, since the first dimension in all retrieved instances is always 0 and thereby non-informative, so classifiers will probably ignore it.

NOTE: If this example seems too minimalistic, imagine the same concept of dropping 1. out of n dimensions


But still, as described in my previous comment, I believe there are scenarios where using transformed data can be beneficial and it should be up to the end user to decide. Returning to the example above, if we have the domain knowledge to be certain that the first dimension in our dataset is irrelevant for the classification problem, we would be better off using transformed data.

Furthermore, if our classifier is using transformations, some bias from these transformations would eventually leak into the instance selection anyways in the form of classifier uncertainty.

BoyanH avatar Oct 21 '20 15:10 BoyanH

ah I was only thinking about uncertainty selection, I did not think about information density or ranked batch mode, in this case yes I think the transformation will have an influence and I agree that it would be beneficial to compute density on the transformed space rather than the original space, or give the user the option to choose.

damienlancry avatar Oct 22 '20 03:10 damienlancry