AlpacaTag
AlpacaTag copied to clipboard
Potential mistake in Active Learning (Acquisition.py)
I believe there is a bug in this line in acquisition.py (which is used to rank and fetch samples based on the confidence score of your model).
Let me explain:
- Line 41 generates batches on which the model will iterate upon. Within each batch, data is reshuffled based on its length, putting longer sequences (which have fewer padded tokens) at the top (as shown here).
- In order to match scores - > samples, we need to extract the sorting info to undo it. This is done a few lines down via
sort_info = data['sort_info']. This returns a tuple which describes the reshuffling that happened in step above. For example, a tuple of the form(3,0,2,1)tells us that the first element in this batch was in fact the 4th one (in the original dataset), the second one was the first and so on. - Finally, because we want the scores to be in the same order as before, the line I mentioned in the beginning, runs the following code command
probscores.extend(list(norm_scores[np.array(sort_info)])). The goal of this is to reshuffle the probabilty/confidence scores back so that they respect the original ordering and not this new, length based ordering that is used within each batch.
The issue is that (if I am not missing something obvious), norm_scores[np.array(sort_info)] is not what we want. Let me explain it with the below example:
- You ask the model to rank you
sentences=[["Hello", "World"], ["This","is","a","big","sentence"], ["Hello", "World", "."]]. - These get reshuffled (based on their size) like so
ordered_sentences=[["This","is","a","big","sentence"], ["Hello", "World", "."], ["Hello", "World"]], giving ussort_info = (1,2,0). - The model will then score them. Let's say it gives us
norm_scores = [0.1, 0.2, 0.3], meaning it gives a score of 0.1 to ["This","is","a","big","sentence"], 0.2 to ["Hello", "World", "."] and 0.3 to ["Hello", "World"]. - Current solution
list(norm_scores[np.array(sort_info)])) will reshuffle it to[0.2, 0.3,0.1]. In the original dataset, this means that we give a score of 0.2 to ["Hello", "World"], 0.3 to ["This","is","a","big","sentence"] and 0.1 to ["Hello", "World"], which is not the same as above
The root of this problem is that sort_info returns indices (via argsort) that lead to the sorted array. It does not return the indices required to unshuffle it. In essence, what we need is the inverse. One proposed solution for this, is to instead used inverse_sorting = [sort_list.index(i) for i in range(len(sort_list))], and then list(norm_scores[np.array(invere_sorting)]). In the above example, inverse_sorting = [2, 0, 1], which in turn gives a score of [0.3, 0.1, 0.2], which is what we want in the original dataset (0.3 for ["Hello", "World"], 0.1 for ["This","is","a","big","sentence"] and 0.2 for ["Hello", "World", "."].
I stumbled on this error by noticing that sentences that were exactly the same, would be given different confidence scores by the model (because of the mistake in undoing the reshuffle). Nevertheless, the example I gave above should suffice.
Here are some screenshots from code execution. I have 4 not labeled examples, namely the ones seen below

Behind the scenes we see the following:

Each line in the terminal shows the following:
- First sample before the shuffling is "Hello world" (as seen in the first picture of this post)
- After the reshuffling, the first one is "This is a long sentence"
- Sorted info is (1,2,0,3) (based on their lengths)
- We then get the normalized scores. Notice how the last two are the same score, because we have the exact same sentences ("Hello world")
- before norm is what we have before we attempt to undo the reshuffle
- after norm is what we have after we attempt to undo the reshuffle
Finally, notice how this is not what we want. For example, after the reshuffling, the first and last sample, which are both the same sentence "Hello world", now have different scores ("-1.04" and "-1.28")
Finally, you can see this in the UI as well:

and
