modAL icon indicating copy to clipboard operation
modAL copied to clipboard

.query on large DataFrame yields "None of [Int64Index([...] dtype='int64')] are in the [columns]"

Open srowen opened this issue 6 years ago • 4 comments

This simple example:

from sklearn.linear_model import LogisticRegression
from modAL.models import ActiveLearner

X = pd.DataFrame([[1],[2],[3]])
y = pd.Series([True, False, False])
my_learner = ActiveLearner(estimator=LogisticRegression(), X_training=X, y_training=y)
df = pd.concat([X]*2000)
query_idx, _ = my_learner.query(df, n_instances=100)

yields:

KeyError: "None of [Int64Index([1665, 1662, 5412, 3399, 1758, 4866, 1755, 3402, 1752, 5415, 3405,\n            1749, 1746, 3408, 1743, 5418, 4863, 1740, 3411, 1737, 3414, 1734,\n            5421, 1731, 3417, 1728, 4860, 3420, 1725, 5424, 1722, 3423, 1719,\n            3426, 1716, 5427, 1713, 4857, 3429, 1710, 3432, 1707, 5430, 1704,\n            3435, 1701, 4854, 1698, 5433, 3438, 1695, 3441, 1692, 1689, 5436,\n            3444, 1686, 4851, 1683, 3447, 1680, 5439, 3450, 1677, 1674, 3453,\n            1671, 5442, 4848, 1668, 3456, 1764, 3459, 5469, 1587, 3492, 1608,\n            5463, 3495, 1605, 1602, 3498, 1599, 5466, 4833, 1596, 3501, 1593,\n            3504, 1590, 4836, 1575, 3513, 3519, 4827, 1569, 5475, 1572, 3516,\n            1614],\n           dtype='int64')] are in the [columns]"

at:

/databricks/python/lib/python3.7/site-packages/modAL/uncertainty.py in uncertainty_sampling(classifier, X, n_instances, random_tie_break, **uncertainty_measure_kwargs)
    157         query_idx = shuffled_argmax(uncertainty, n_instances=n_instances)
    158 
--> 159     return query_idx, X[query_idx]

It works fine with a smaller input, like:

...
query_idx, _ = my_learner.query(X, n_instances=1)

It seems like query_idx is an array for smaller input, but a different index representation, Int64Index when the number of instances or input is large. And then that can't be used for indexing rows in X.

Is it possible that this needs to be X.iloc[query_idx]? I don't really know enough pandas to know for sure. Thanks!

srowen avatar Nov 07 '19 04:11 srowen

Unfortunately, modAL currently does not support pandas support for technical reasons. I have detailed the problem in #20, but the gist is, numpy arrays use row first indexing, while pandas DataFrames use column first by default. This led to tehcnical difficulties which I was unable to implement a proper solution. Of course there would be workarounds, but I did not come up with a robust solution.

cosmic-cortex avatar Nov 07 '19 08:11 cosmic-cortex

Oh I see, didn't realize as it happened to mostly work fine with DataFrames. Isn't .iloc how you'd do row indexing in pandas? You could selectively calls this if it's a DataFrame. For my use case numpy arrays would work fine too though.

srowen avatar Nov 07 '19 10:11 srowen

Yes, .iloc would work, but I didn't want to selectively call X[idx] or X.iloc[idx] based on the datatype. That solution would be very hard to extend for a third datatype. A possible solution would be to introduce a wrapper class modALinput which would handle all the supported datatypes.

cosmic-cortex avatar Nov 07 '19 11:11 cosmic-cortex

If I use .loc, then facing the error as stated below: KeyError: "None of [Int64Index([ 160, 161, 162, 163, 164, 165, 166, 167, 168, 169,\n ...\n 1590, 1591, 1592, 1593, 1594, 1595, 1596, 1597, 1598, 1599],\n dtype='int64', name='', length=1440)] are in the [index]"

whereas; if I use .iloc, then facing the following error: IndexError: positional indexers are out-of-bounds

Could you please let me know how to approach this.

vidushi-chouksey avatar Sep 28 '21 07:09 vidushi-chouksey