FreeDiscovery icon indicating copy to clipboard operation
FreeDiscovery copied to clipboard

Active learning query stategies

Open rth opened this issue 7 years ago • 0 comments

While the implementation of query strategies while doing text categorization iterations (active learning) is probably beyond the scope of FreeDiscovery, this issue aims to ensure than the output of the FreeDiscovery API contains sufficient information to apply active learning query strategies on them.

Here is a brief overview of possible query approaches (adapted from wikipedia)

  1. Uncertainty sampling: label those points for which the current model is least certain as to what the correct output should be. [For instance, SVM this could be the margin to the hyperplane].

Since we return the decision_function (typically in [-eps , +eps]), this would mean selecting the points with the decision_function around zero (lowest absolute value).

  1. Query by committee: a variety of models are trained on the current labeled data, and vote on the output for unlabeled data; label those points for which the "committee" disagrees the most

Currently this would mean running multiple categorizations with different algorithms and combining the results.

  1. Expected model change: label those points that would most change the current model
  2. Expected error reduction: label those points that would most reduce the model's generalization error
  3. Variance reduction: label those points that would minimize output variance, which is one of the components of error

All of these are not implemented. They also sound quite computationally expensive as this would mean a large number of training / scoring iterations .

  1. Balance exploration and exploitation: the choice of examples to label is seen as a dilemma between the exploration and the exploitation over the data space representation. This strategy manages this compromise by modelling the active learning problem as a contextual bandit problem. For example, Bouneffouf et al.[6] propose a sequential algorithm named Active Thompson Sampling (ATS), which, in each round, assigns a sampling distribution on the pool, samples one point from this distribution, and queries the oracle for this sample point label.
  2. Exponentiated Gradient Exploration for Active Learning: In this paper, the author proposes a sequential algorithm named exponentiated gradient (EG)-active that can improve any active learning algorithm by an optimal random exploration.

Probably not applicable.

In addition there is a few existing active learning libraries in Python, namely libact, iitml/AL and it could be worth considering whether adding them to FreeDiscovery could be beneficial...

rth avatar Mar 21 '17 11:03 rth