modAL
modAL copied to clipboard
Input data with different lengths / filled with NAs
I'm trying to use modAL in combination with tslearn to classify timeseries of different lengths. tslearn supports variable-length time series by filling the shorter time series up with NAs, but modAL calls
check_X_y(X, y, accept_sparse=True, ensure_2d=False, allow_nd=True, multi_output=True)
without setting force_all_finite = 'allow-nan'
.
Is there a reason for not allowing NAs, or did this use case just not come up before?
Thanks a lot!
Seems like a relatively straightforward update might be to add force_all_finite: bool or str = True
as a parameter to BaseLearner
init and then pass self.force_all_finite
to check_X_y
in each function where it is called (_add_training_data
, _fit_on_new
and fit
).
Thanks for the observation and sorry for the late answer! I have just reached this task in my backlog :)
This case hasn't come up before. I don't see any reason to not allow NaNs, so we can just set force_all_finite = 'allow-nan'
in every call of check_X_y
. I like the solution of @zaksamalik, so I'll add this sometime during this week.
Cool, thanks a lot!
I have fixed a problem and additionally released the new version, this fix included. Let me know if there is a problem!
Hi, it seems the issue is still present in Ranked batch-mode sampling.
Reprex (mostly from Ranked batch-mode sampling documentation)
import numpy as np
import xgboost as xgb
from functools import partial
from modAL.batch import uncertainty_batch_sampling
from modAL.models import ActiveLearner
iris = load_iris()
X_raw = iris['data']
y_raw = iris['target']
# Isolate our examples for our labeled dataset.
n_labeled_examples = X_raw.shape[0]
training_indices = np.random.randint(low=0, high=n_labeled_examples + 1, size=3)
X_train = X_raw[training_indices]
y_train = y_raw[training_indices]
# Isolate the non-training examples we'll be querying.
X_pool = np.delete(X_raw, training_indices, axis=0)
y_pool = np.delete(y_raw, training_indices, axis=0)
# Setting an column's entry as np.nan
X_pool[0][0] = np.nan
# Pre-set our batch sampling to retrieve 3 samples at a time.
BATCH_SIZE = 3
preset_batch = partial(uncertainty_batch_sampling, n_instances=BATCH_SIZE)
# Specify our active learning model.
learner = ActiveLearner(
estimator=xgb.XGBClassifier(),
X_training=X_train,
y_training=y_train,
query_strategy=preset_batch,
force_all_finite=False
)
query_index, query_instance = learner.query(X_pool)
Error message
Click to expand!
ValueError Traceback (most recent call last)
in
40 )
41
---> 42 query_index, query_instance = learner.query(X_pool)
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/models/base.py in query(self, *query_args, **query_kwargs)
201 labelled upon query synthesis.
--> 203 query_result = self.query_strategy(self, *query_args, **query_kwargs)
204 return query_result
205
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/batch.py in uncertainty_batch_sampling(classifier, X, n_instances, metric, n_jobs, **uncertainty_measure_kwargs)
208 uncertainty = classifier_uncertainty(classifier, X, **uncertainty_measure_kwargs)
209 query_indices = ranked_batch(classifier, unlabeled=X, uncertainty_scores=uncertainty,
--> 210 n_instances=n_instances, metric=metric, n_jobs=n_jobs)
211 return query_indices, X[query_indices]
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/batch.py in ranked_batch(classifier, unlabeled, uncertainty_scores, n_instances, metric, n_jobs)
161 instance_index, instance, mask = select_instance(X_training=labeled, X_pool=unlabeled,
162 X_uncertainty=uncertainty_scores, mask=mask,
--> 163 metric=metric, n_jobs=n_jobs)
164
165 # Add our instance we've considered for labeling to our labeled set. Although we don't
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/modAL/batch.py in select_instance(X_training, X_pool, X_uncertainty, mask, metric, n_jobs)
97 _, distance_scores = pairwise_distances_argmin_min(X_pool_masked.reshape(n_unlabeled, -1),
98 X_training.reshape(n_labeled_records, -1),
---> 99 metric=metric)
100 else:
101 distance_scores = pairwise_distances(X_pool_masked.reshape(n_unlabeled, -1),
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in pairwise_distances_argmin_min(X, Y, axis, metric, metric_kwargs)
573 sklearn.metrics.pairwise_distances_argmin
--> 575 X, Y = check_pairwise_arrays(X, Y)
576
577 if metric_kwargs is None:
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/metrics/pairwise.py in check_pairwise_arrays(X, Y, precomputed, dtype, accept_sparse, force_all_finite, copy)
139 X = check_array(X, accept_sparse=accept_sparse, dtype=dtype,
140 copy=copy, force_all_finite=force_all_finite,
--> 141 estimator=estimator)
142 Y = check_array(Y, accept_sparse=accept_sparse, dtype=dtype,
143 copy=copy, force_all_finite=force_all_finite,
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/utils/validation.py in check_array(array, accept_sparse, accept_large_sparse, dtype, order, copy, force_all_finite, ensure_2d, allow_nd, ensure_min_samples, ensure_min_features, warn_on_dtype, estimator)
576 if force_all_finite:
577 _assert_all_finite(array,
--> 578 allow_nan=force_all_finite == 'allow-nan')
579
580 if ensure_min_samples > 0:
~/.pyenv/versions/python-3.7.4/lib/python3.7/site-packages/sklearn/utils/validation.py in _assert_all_finite(X, allow_nan, msg_dtype)
58 msg_err.format
59 (type_err,
---> 60 msg_dtype if msg_dtype is not None else X.dtype)
61 )
62 # for object dtype data, we only check for NaNs (GH-13254)
ValueError: Input contains NaN, infinity or a value too large for dtype('float64').
Hi!
This seems like a scikit-learn issue :( The function pairwise_distances_argmin_min
is called, which throws the error upon encountering the NaN. Unfortunately, there is no way to control this in the function.
Similarly, if you set force_all_finite=False
but use an estimator which doesn't support this (like the ones in scikit-learn), it won't work, even though modAL allows you to use data with NaNs.
Do you have any suggestions how to solve this? At the moment, I don't see a proper solution, but this doesn't mean that there isn't one. (I don't want to internally remove NaNs and pass them to the external functions, because this would remain hidden from the user, possibly causing unintended consequences.)
Hi!, as you correctly mentioned this should only work for models that can handle missing values such as novel boosting methods (i.e. xgboost).
Alternatively, nan_euclidean_distances
function could serve to solve the issue at the expense of limiting the distance metric to euclidean. Thoughts?
That is a good idea! I am going to take a shot this. I don't promise to do this ASAP since I am extremely busy with other work, but I'll try to do it this month.