modAL
modAL copied to clipboard
.query on large DataFrame yields "None of [Int64Index([...] dtype='int64')] are in the [columns]"
This simple example:
from sklearn.linear_model import LogisticRegression
from modAL.models import ActiveLearner
X = pd.DataFrame([[1],[2],[3]])
y = pd.Series([True, False, False])
my_learner = ActiveLearner(estimator=LogisticRegression(), X_training=X, y_training=y)
df = pd.concat([X]*2000)
query_idx, _ = my_learner.query(df, n_instances=100)
yields:
KeyError: "None of [Int64Index([1665, 1662, 5412, 3399, 1758, 4866, 1755, 3402, 1752, 5415, 3405,\n 1749, 1746, 3408, 1743, 5418, 4863, 1740, 3411, 1737, 3414, 1734,\n 5421, 1731, 3417, 1728, 4860, 3420, 1725, 5424, 1722, 3423, 1719,\n 3426, 1716, 5427, 1713, 4857, 3429, 1710, 3432, 1707, 5430, 1704,\n 3435, 1701, 4854, 1698, 5433, 3438, 1695, 3441, 1692, 1689, 5436,\n 3444, 1686, 4851, 1683, 3447, 1680, 5439, 3450, 1677, 1674, 3453,\n 1671, 5442, 4848, 1668, 3456, 1764, 3459, 5469, 1587, 3492, 1608,\n 5463, 3495, 1605, 1602, 3498, 1599, 5466, 4833, 1596, 3501, 1593,\n 3504, 1590, 4836, 1575, 3513, 3519, 4827, 1569, 5475, 1572, 3516,\n 1614],\n dtype='int64')] are in the [columns]"
at:
/databricks/python/lib/python3.7/site-packages/modAL/uncertainty.py in uncertainty_sampling(classifier, X, n_instances, random_tie_break, **uncertainty_measure_kwargs)
157 query_idx = shuffled_argmax(uncertainty, n_instances=n_instances)
158
--> 159 return query_idx, X[query_idx]
It works fine with a smaller input, like:
...
query_idx, _ = my_learner.query(X, n_instances=1)
It seems like query_idx is an array for smaller input, but a different index representation, Int64Index when the number of instances or input is large. And then that can't be used for indexing rows in X.
Is it possible that this needs to be X.iloc[query_idx]? I don't really know enough pandas to know for sure. Thanks!
Unfortunately, modAL currently does not support pandas support for technical reasons. I have detailed the problem in #20, but the gist is, numpy arrays use row first indexing, while pandas DataFrames use column first by default. This led to tehcnical difficulties which I was unable to implement a proper solution. Of course there would be workarounds, but I did not come up with a robust solution.
Oh I see, didn't realize as it happened to mostly work fine with DataFrames.
Isn't .iloc how you'd do row indexing in pandas? You could selectively calls this if it's a DataFrame.
For my use case numpy arrays would work fine too though.
Yes, .iloc would work, but I didn't want to selectively call X[idx] or X.iloc[idx] based on the datatype. That solution would be very hard to extend for a third datatype. A possible solution would be to introduce a wrapper class modALinput which would handle all the supported datatypes.
If I use .loc, then facing the error as stated below: KeyError: "None of [Int64Index([ 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,\n ...\n 1590, 1591, 1592, 1593, 1594, 1595, 1596, 1597, 1598, 1599],\n dtype='int64', name='', length=1440)] are in the [index]"
whereas; if I use .iloc, then facing the following error: IndexError: positional indexers are out-of-bounds
Could you please let me know how to approach this.