vecalign icon indicating copy to clipboard operation
vecalign copied to clipboard

error in the make_del_knob function?

Open frankang opened this issue 4 years ago • 1 comments

In the make_del_knob function, when the size product (e_size * f_size) is smaller than the sample_size (20000 by default), the script ends up calculating the similarity score for all combinations of the src and tgt sentences, plus the remainder (20000 - e_size * f_size) . Is this behavior a mistake or an intended feature? It creates a biased histogram of the "real" distrubution by calculating multiple pairs on the 0:0 indexed sentences.

if e_size * f_size < sample_size:
    # dont sample, just compute full matrix
    sample_size = e_size * f_size
    x_idxs = np.zeros(sample_size, dtype=np.int32)
    y_idxs = np.zeros(sample_size, dtype=np.int32)
    c = 0
    for ii in range(e_size):
        for jj in range(f_size):
            x_idxs[c] = ii
            y_idxs[c] = jj
            c += 1
else:
    # get random samples
    x_idxs = np.random.choice(range(e_size), size=sample_size, replace=True).astype(np.int32)
    y_idxs = np.random.choice(range(f_size), size=sample_size, replace=True).astype(np.int32)

# output
random_scores = np.empty(sample_size, dtype=np.float32)

score_path(x_idxs, y_idxs,
           e_laser_norms, f_laser_norms,
           e_laser, f_laser,
           random_scores, )

frankang avatar Sep 14 '20 12:09 frankang

What do you mean by plus the remainder (20000 - e_size * f_size)? Which lines are you referring to?

The variable sample_size = e_size * f_size stores the correct size (in both cases). If e_size * f_size < sample_size is true, sample_size is overwritten.

janisdd avatar Apr 25 '24 13:04 janisdd