argilla [Active Learning] Small text tutorial hits errors when running on local

The small text tutorial should be review since some weird errors are hit when running on a local machine with some customizations. (Setting a different initial batch size, no CUDA installed...)

Aug 22 '22 14:08 frascuchon

Just spotted this. Which errors exactly? Is this something in the scope of tutorial or can this be improved upstream in small-text?

Sep 15 '22 10:09 chschroeder

Hi @chschroeder, sorry for the late response and thank yo for ask.

I've recently launched the tutorial from an clean python environment and I found several problems:

Disabling the CUDA configuration works fine, but quering new records takes really too much time. Not sure if we can improve it somehow.
The dataset feature names are changed. The label-coarse field now is called coarse_label. We can handle this easily.
For the initial batch (batch_id=0), labeling all records with the same label raises the error: Updating with batch_id 0 ... Traceback (most recent call last): File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/rubrix/listeners/listener.py", line 219, in run_action return self.action(*args, *action_args, **kwargs) File "/var/folders/8f/mt_m87_d19q3zcnyr6dmf0pw0000gn/T/ipykernel_85718/568527003.py", line 30, in active_learning_loop active_learner.initialize_data(indices, y) File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/active_learner.py", line 151, in initialize_data self._retrain(indices_validation=indices_validation) File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/active_learner.py", line 390, in _retrain self._clf.fit(dataset) File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 330, in fit return self._fit_main(sub_train, sub_valid, fit_optimizer, fit_scheduler) File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 340, in _fit_main raise ValueError('Conflicting information about the number of classes: ' ValueError: Conflicting information about the number of classes: expected: 6, encountered: 4 This situation does not occurs for next batches (<=1)

Really don't know what should be the behaviour and how could we avoid it. Any ideas? @dvsrepo @dcfidalgo @chschroeder

Sep 21 '22 15:09 frascuchon

No worries, I missed that issue here as well. Feel free to ping me in such cases as you did now :).

Disabling the CUDA configuration works fine, but quering new records takes really too much time. Not sure if we can improve it somehow.

Transformers on the CPU (at least when training is involved) is usually so time consuming that you don't want to do that.

Suggestion: Fall back to using a very small transformer model such as bert-medium or bert-tiny if CUDA is not available. I would still print a warning which tells the users that this intended to run on a GPU (maybe provide a Colab link?) and is running now in a cpu-only fallback mode which is slow (and yields worse results in case a smaller model is used). (Check once if the classification results are still acceptable after that.) If it's still too slow after these changes, you can try lowering the number of epochs (by setting num_epochs in the factory's kwargs argument).
For the initial batch (batch_id=0), labeling all records with the same label raises the error: [...]

First: This problem is caused by a safety check. The intention behind that check was that every class must occur at least once (i.e. the number encountered classes must match the number of classes of the model). In this case this check was a good thing, since it told you that the initialization could be better. In reality, this might not be achievable every time, especially in multi-label scenarios. Therefore I removed this check in the current dev version of small-text.

For now, you can switch from random initialization to balanced random initialization:
```
from small_text.initialization import random_initialization
[...]
# Randomly draw an initial subset from the data pool
initial_indices = random_initialization(dataset, NUM_SAMPLES)
```
-->
```
from small_text.initialization import random_initialization_balanced
[...]
# Randomly draw a *class-balanced* initial subset from the data pool
initial_indices = random_initialization_balanced(dataset, NUM_SAMPLES)
```
random_initialization_balanced provides an initial set where the label distribution is balanced over the classes (or close to it otherwise). This also means that every class occurs at least once if your initial set size is larger or equal than the number of classes.

With small-text 1.1.0 this error will not be raised anymore.

Sep 21 '22 15:09 chschroeder

Thanks a lot @chschroeder

I've tried using the balanced random initialization but it hitting the following error (INITIAL_SAMPLES=20):

RuntimeError                              Traceback (most recent call last)
Cell In [6], line 12
      9 NUM_SAMPLES = 5
     11 # Randomly draw an initial subset from the data pool
---> 12 initial_indices = random_initialization_balanced(dataset, INITIAL_SAMPLES)

File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/initialization/strategies.py:84, in random_initialization_balanced(y, n_samples)
     82     raise NotImplementedError()
     83 else:
---> 84     return balanced_sampling(y, n_samples=n_samples)

File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/data/sampling.py:120, in balanced_sampling(y, n_samples)
    117     y = np.array(y)
    119 # num classes according to the labels
--> 120 num_classes = np.max(y) + 1
    121 # num classes encountered
    122 num_classes_present = len(np.unique(y))

File <__array_function__ internals>:180, in amax(*args, **kwargs)

File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/numpy/core/fromnumeric.py:2793, in amax(a, axis, out, keepdims, initial, where)
   2677 @array_function_dispatch(_amax_dispatcher)
   2678 def amax(a, axis=None, out=None, keepdims=np._NoValue, initial=np._NoValue,
   2679          where=np._NoValue):
   2680     """
   2681     Return the maximum of an array or maximum along an axis.
   2682 
   (...)
   2791     5
   2792     """
-> 2793     return _wrapreduction(a, np.maximum, 'max', axis, None, out,
   2794                           keepdims=keepdims, initial=initial, where=where)

File /usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/numpy/core/fromnumeric.py:86, in _wrapreduction(obj, ufunc, method, axis, dtype, out, **kwargs)
     83         else:
     84             return reduction(axis=axis, out=out, **passkwargs)
---> 86 return ufunc.reduce(obj, axis, dtype, out, **passkwargs)

RuntimeError: Boolean value of Tensor with more than one value is ambiguous

Sep 22 '22 07:09 frascuchon

Oh, my bad :). This initialization method takes the labels as arguments (dataset -> dataset.y). I just replaced the function name and forgot to adapt this part.

initial_indices = random_initialization_balanced(dataset.y, NUM_SAMPLES)

Sep 22 '22 08:09 chschroeder

Great!

I've changed some code from the tutorial (the training set was initialized with LABEL_UNLABELLED values and was not possible to use labels with the random_initialization_balanced function.

Now it's working but, for the initial annotated batch, passing only one label (all records in batch annotated with the same value) hits a similar error:

Updating with batch_id 0 ...
Traceback (most recent call last):
  File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/rubrix/listeners/listener.py", line 219, in __run_action__
    return self.action(*args, *action_args, **kwargs)
  File "/var/folders/8f/mt_m87_d19q3zcnyr6dmf0pw0000gn/T/ipykernel_11659/568527003.py", line 30, in active_learning_loop
    active_learner.initialize_data(indices, y)
  File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/active_learner.py", line 151, in initialize_data
    self._retrain(indices_validation=indices_validation)
  File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/active_learner.py", line 390, in _retrain
    self._clf.fit(dataset)
  File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 330, in fit
    return self._fit_main(sub_train, sub_valid, fit_optimizer, fit_scheduler)
  File "/usr/local/anaconda3/envs/local-test/lib/python3.9/site-packages/small_text/integrations/transformers/classifiers/classification.py", line 340, in _fit_main
    raise ValueError('Conflicting information about the number of classes: '
ValueError: Conflicting information about the number of classes: expected: 6, encountered: 1

Any idea to handle this?

Sep 22 '22 09:09 frascuchon

Hm, in this case the required amount of classes are of course impossible to find. Using small-text==1.1.0 would make the symptoms go away. (How urget is this fix? I could prioritize the release depending on that.)

However, this might also give a false sense of security. Under these circumstances the model has not seen all classes, therefore the uncertainties may be suboptimal, and thus the queries examples may not be that useful as they otherwise could be.

In my own active learning settings with few labels I always argue that it is reasonable to require 1-2 examples per class from the user before starting the active learning loop. I know that this is not applicable to use cases with thousands of labels. There are so-called cold start approaches which try to handle this setting, but as far as I know there is no single best approach as well, and each method brings its on advantages/disadvantages as well.

In the end, this is a decision for your rubrix workflows as well. Possible solutions (non-exhaustive): a) Do nothing (once the error is fixed) and trust the user. b) Show a warning. c) Show a warning and advise better settings dynamically (e.g., request the user to provide examples for all classes or recommend another query strategy). d) Describe the problem and give recommendations in the documentation (Might also be my responsibility to do this in the small-text docs regardless of what you decide on the workflows here).

Sep 22 '22 11:09 chschroeder

Hi @chschroeder

Thanks a lot for your responses. It's not an urgent problem. We'll just wait until the new small-text version is released.

Anyway, It should be a good practice to include a balanced annotated dataset for initial batches.

Sep 22 '22 12:09 frascuchon