modAL icon indicating copy to clipboard operation
modAL copied to clipboard

Why teach method fits the data?

Open Efesencan opened this issue 4 years ago • 4 comments

When you teach your Active Learner the queried instance and its label, instead of just adding these new instances to train dataset, it also fits the model with these newly labeled dataset. But this is unnecessary. Because in my case, when I learn the label of one instance, I learn the 300 (this number can vary) other instances label(since they share the same label) automatically. Therefore, I have to teach 300 new instances at each query iteration to the Active Learner which takes a lot time because of the fit method. For this reason, I believe that fitting the data should be performed only in query method.

Efesencan avatar Aug 03 '20 20:08 Efesencan

You can say that, one may want to use predict method right after it teaches. That's why fit method is used inside the teach method, but as I described the above issue that approach is problematic. At least there should be an option of whether fitting the data will be performed or not.

Efesencan avatar Aug 03 '20 20:08 Efesencan

To only add training data without refitting the estimator, you can use the ActiveLearner._add_training_data method. (Here is the implementation: https://github.com/modAL-python/modAL/blob/master/modAL/models/base.py#L68-L92)

This is a "private" method, so I didn't include it in the documentation, but the method itself is documented, so it should be easy to use.

I don't understand your use case and argument exactly. What is the underlying model you use?

If by querying a single label you learn multiple other labels indirectly, than you can manually add these to the X_new and y_new before calling the teach method. This is roughly what I mean:

query_idx, X_query = learner.query(X_pool)

# ...
# get the label for X_query somehow
# ...

X_other, y_other = ... # these are the instances and labels you find indirectly after querying a single label

X_new = np.concat((X_query, X_other))
y_new = np.concat((y_query, y_other))

learner.teach(X_new, y_new)

cosmic-cortex avatar Aug 04 '20 07:08 cosmic-cortex

Okay, I got your point. My another question is that, should I delete the queried instance from the X_pool and its corresponding label from y_pool after I make a query(learn the label) and teach them at each query iteration? Or is it unnecessary?

Efesencan avatar Aug 04 '20 10:08 Efesencan

Yes, it should be deleted manually. Otherwise, the query strategy might select data which is already part of your training data, hence possibly leading to model bias in some scenarios.

There is a PR by @talolard who proposed a data manager class, but eventually decided to put this into a completely new package. I don't know the status on this, but will be very useful for this case.

cosmic-cortex avatar Aug 04 '20 11:08 cosmic-cortex