AlpacaTag Potential mistake in Active Learning (Acquisition.py)

Potential mistake in Active Learning (Acquisition.py)

Open ddnimara opened this issue 3 years ago • 6 comments

trafficstars

I believe there is a bug in this line in acquisition.py (which is used to rank and fetch samples based on the confidence score of your model).

Let me explain:

Line 41 generates batches on which the model will iterate upon. Within each batch, data is reshuffled based on its length, putting longer sequences (which have fewer padded tokens) at the top (as shown here).
In order to match scores - > samples, we need to extract the sorting info to undo it. This is done a few lines down via sort_info = data['sort_info']. This returns a tuple which describes the reshuffling that happened in step above. For example, a tuple of the form (3,0,2,1) tells us that the first element in this batch was in fact the 4th one (in the original dataset), the second one was the first and so on.
Finally, because we want the scores to be in the same order as before, the line I mentioned in the beginning, runs the following code command probscores.extend(list(norm_scores[np.array(sort_info)])). The goal of this is to reshuffle the probabilty/confidence scores back so that they respect the original ordering and not this new, length based ordering that is used within each batch.

The issue is that (if I am not missing something obvious), norm_scores[np.array(sort_info)] is not what we want. Let me explain it with the below example:

You ask the model to rank you sentences=[["Hello", "World"], ["This","is","a","big","sentence"], ["Hello", "World", "."]].
These get reshuffled (based on their size) like so ordered_sentences=[["This","is","a","big","sentence"], ["Hello", "World", "."], ["Hello", "World"]], giving us sort_info = (1,2,0).
The model will then score them. Let's say it gives us norm_scores = [0.1, 0.2, 0.3], meaning it gives a score of 0.1 to ["This","is","a","big","sentence"], 0.2 to ["Hello", "World", "."] and 0.3 to ["Hello", "World"].
Current solution list(norm_scores[np.array(sort_info)])) will reshuffle it to [0.2, 0.3,0.1]. In the original dataset, this means that we give a score of 0.2 to ["Hello", "World"], 0.3 to ["This","is","a","big","sentence"] and 0.1 to ["Hello", "World"], which is not the same as above

The root of this problem is that sort_info returns indices (via argsort) that lead to the sorted array. It does not return the indices required to unshuffle it. In essence, what we need is the inverse. One proposed solution for this, is to instead used inverse_sorting = [sort_list.index(i) for i in range(len(sort_list))], and then list(norm_scores[np.array(invere_sorting)]). In the above example, inverse_sorting = [2, 0, 1], which in turn gives a score of [0.3, 0.1, 0.2], which is what we want in the original dataset (0.3 for ["Hello", "World"], 0.1 for ["This","is","a","big","sentence"] and 0.2 for ["Hello", "World", "."].

I stumbled on this error by noticing that sentences that were exactly the same, would be given different confidence scores by the model (because of the mistake in undoing the reshuffle). Nevertheless, the example I gave above should suffice.

May 09 '22 13:05 ddnimara

Here are some screenshots from code execution. I have 4 not labeled examples, namely the ones seen below Screenshot 2022-05-09 at 16 15 52

Behind the scenes we see the following: Screenshot 2022-05-09 at 16 26 52

Each line in the terminal shows the following:

First sample before the shuffling is "Hello world" (as seen in the first picture of this post)
After the reshuffling, the first one is "This is a long sentence"
Sorted info is (1,2,0,3) (based on their lengths)
We then get the normalized scores. Notice how the last two are the same score, because we have the exact same sentences ("Hello world")
before norm is what we have before we attempt to undo the reshuffle
after norm is what we have after we attempt to undo the reshuffle

Finally, notice how this is not what we want. For example, after the reshuffling, the first and last sample, which are both the same sentence "Hello world", now have different scores ("-1.04" and "-1.28")

Finally, you can see this in the UI as well: Screenshot 2022-05-09 at 16 14 18

and Screenshot 2022-05-09 at 16 15 11

May 09 '22 14:05 ddnimara

AlpacaTag AlpacaTag copied to clipboard

Potential mistake in Active Learning (Acquisition.py)

AlpacaTag
AlpacaTag copied to clipboard