SLED
SLED copied to clipboard
Step 6 of Algorithm 1 code clarification
The implementation applies a second torch.topk when computing m_i^(n), while Algorithm 1 in the paper defines m_i^(n) over all i_k which is the top-k from the final layer. Could you please clarify if this is intentional or an oversight?
layer_dot_results = F.cosine_similarity(candidate_gradients_expanded, layer_divergence_expanded, dim=2) layer_topk_values, layer_topk_indices = torch.topk(layer_dot_results, evolution_scale) layer_topk_topk_indices = topk_indices[layer_topk_indices]