modAL icon indicating copy to clipboard operation
modAL copied to clipboard

Input data with different lengths / filled with NAs

Open nightscape opened this issue 5 years ago • 8 comments

I'm trying to use modAL in combination with tslearn to classify timeseries of different lengths. tslearn supports variable-length time series by filling the shorter time series up with NAs, but modAL calls

check_X_y(X, y, accept_sparse=True, ensure_2d=False, allow_nd=True, multi_output=True)

without setting force_all_finite = 'allow-nan'. Is there a reason for not allowing NAs, or did this use case just not come up before?

Thanks a lot!

nightscape avatar Oct 31 '19 19:10 nightscape

Seems like a relatively straightforward update might be to add force_all_finite: bool or str = True as a parameter to BaseLearner init and then pass self.force_all_finite to check_X_y in each function where it is called (_add_training_data, _fit_on_new and fit).

zaksamalik avatar Nov 06 '19 21:11 zaksamalik

Thanks for the observation and sorry for the late answer! I have just reached this task in my backlog :)

This case hasn't come up before. I don't see any reason to not allow NaNs, so we can just set force_all_finite = 'allow-nan' in every call of check_X_y. I like the solution of @zaksamalik, so I'll add this sometime during this week.

cosmic-cortex avatar Nov 07 '19 08:11 cosmic-cortex

Cool, thanks a lot!

nightscape avatar Nov 08 '19 08:11 nightscape

I have fixed a problem and additionally released the new version, this fix included. Let me know if there is a problem!

cosmic-cortex avatar Nov 11 '19 10:11 cosmic-cortex

Hi, it seems the issue is still present in Ranked batch-mode sampling.

Reprex (mostly from Ranked batch-mode sampling documentation)

import numpy as np
import xgboost as xgb 
from functools import partial
from modAL.batch import uncertainty_batch_sampling
from modAL.models import ActiveLearner

iris = load_iris()
X_raw = iris['data']
y_raw = iris['target']

# Isolate our examples for our labeled dataset.
n_labeled_examples = X_raw.shape[0]
training_indices = np.random.randint(low=0, high=n_labeled_examples + 1, size=3)

X_train = X_raw[training_indices]
y_train = y_raw[training_indices]

# Isolate the non-training examples we'll be querying.
X_pool = np.delete(X_raw, training_indices, axis=0)
y_pool = np.delete(y_raw, training_indices, axis=0)

# Setting an column's entry as np.nan
X_pool[0][0] = np.nan

# Pre-set our batch sampling to retrieve 3 samples at a time.
BATCH_SIZE = 3
preset_batch = partial(uncertainty_batch_sampling, n_instances=BATCH_SIZE)

# Specify our active learning model.
learner = ActiveLearner(
  estimator=xgb.XGBClassifier(),
  X_training=X_train,
  y_training=y_train,
  query_strategy=preset_batch,
  force_all_finite=False

)

query_index, query_instance = learner.query(X_pool)

Error message

Click to expand!
ValueError                                Traceback (most recent call last)
 in 
     40 )
     41 
---> 42 query_index, query_instance = learner.query(X_pool)

~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/models/base.py in query(self, *query_args, **query_kwargs)
    201             labelled upon query synthesis.
--> 203         query_result = self.query_strategy(self, *query_args, **query_kwargs)
    204         return query_result
    205 

~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/batch.py in uncertainty_batch_sampling(classifier, X, n_instances, metric, n_jobs, **uncertainty_measure_kwargs)
    208     uncertainty = classifier_uncertainty(classifier, X, **uncertainty_measure_kwargs)
    209     query_indices = ranked_batch(classifier, unlabeled=X, uncertainty_scores=uncertainty,
--> 210                                  n_instances=n_instances, metric=metric, n_jobs=n_jobs)
    211     return query_indices, X[query_indices]

~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/batch.py in ranked_batch(classifier, unlabeled, uncertainty_scores, n_instances, metric, n_jobs)
    161         instance_index, instance, mask = select_instance(X_training=labeled, X_pool=unlabeled,
    162                                                          X_uncertainty=uncertainty_scores, mask=mask,
--> 163                                                          metric=metric, n_jobs=n_jobs)
    164 
    165         # Add our instance we've considered for labeling to our labeled set. Although we don't

~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/batch.py in select_instance(X_training, X_pool, X_uncertainty, mask, metric, n_jobs)
     97         _, distance_scores = pairwise_distances_argmin_min(X_pool_masked.reshape(n_unlabeled, -1),
     98                                                            X_training.reshape(n_labeled_records, -1),
---> 99                                                            metric=metric)
    100     else:
    101         distance_scores = pairwise_distances(X_pool_masked.reshape(n_unlabeled, -1),

~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in pairwise_distances_argmin_min(X, Y, axis, metric, metric_kwargs)
    573     sklearn.metrics.pairwise_distances_argmin
--> 575     X, Y = check_pairwise_arrays(X, Y)
    576 
    577     if metric_kwargs is None:

~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype, accept_sparse, force_all_finite, copy)
    139         X = check_array(X, accept_sparse=accept_sparse, dtype=dtype,
    140                         copy=copy, force_all_finite=force_all_finite,
--> 141                         estimator=estimator)
    142         Y = check_array(Y, accept_sparse=accept_sparse, dtype=dtype,
    143                         copy=copy, force_all_finite=force_all_finite,

~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
    576         if force_all_finite:
    577             _assert_all_finite(array,
--> 578                                allow_nan=force_all_finite == 'allow-nan')
    579 
    580     if ensure_min_samples > 0:

~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
     58                     msg_err.format
     59                     (type_err,
---> 60                      msg_dtype if msg_dtype is not None else X.dtype)
     61             )
     62     # for object dtype data, we only check for NaNs (GH-13254)

ValueError: Input contains NaN, infinity or a value too large for dtype('float64').

ciberger avatar May 15 '20 15:05 ciberger

Hi!

This seems like a scikit-learn issue :( The function pairwise_distances_argmin_min is called, which throws the error upon encountering the NaN. Unfortunately, there is no way to control this in the function.

Similarly, if you set force_all_finite=False but use an estimator which doesn't support this (like the ones in scikit-learn), it won't work, even though modAL allows you to use data with NaNs.

Do you have any suggestions how to solve this? At the moment, I don't see a proper solution, but this doesn't mean that there isn't one. (I don't want to internally remove NaNs and pass them to the external functions, because this would remain hidden from the user, possibly causing unintended consequences.)

cosmic-cortex avatar May 16 '20 07:05 cosmic-cortex

Hi!, as you correctly mentioned this should only work for models that can handle missing values such as novel boosting methods (i.e. xgboost).

Alternatively, nan_euclidean_distances function could serve to solve the issue at the expense of limiting the distance metric to euclidean. Thoughts?

ciberger avatar May 17 '20 16:05 ciberger

That is a good idea! I am going to take a shot this. I don't promise to do this ASAP since I am extremely busy with other work, but I'll try to do it this month.

cosmic-cortex avatar May 17 '20 19:05 cosmic-cortex