Kaggle_Competition_Treasure
Kaggle_Competition_Treasure copied to clipboard
Further questions for the kernel of decting fake samples in Stantander 2019
首先非常感谢杰少对于Kaggle竞赛的相关分享。 在此作为一个初入此领域的同学想请问下杰少,关于List of Fake Samples and Public/Private LB split中,能否在查找samples generator部分的两个code block上,进行代码逻辑上的解释吗? 代码如下:
df_test_real = df_test[real_samples_indexes].copy()
generator_for_each_synthetic_sample = []
# Using 20,000 samples should be enough.
# You can use all of the 100,000 and get the same results (but 5 times slower)
for cur_sample_index in tqdm(synthetic_samples_indexes[:20000]):
cur_synthetic_sample = df_test[cur_sample_index]
potential_generators = df_test_real == cur_synthetic_sample
# A verified generator for a synthetic sample is achieved
# only if the value of a feature appears only once in the
# entire real samples set
features_mask = np.sum(potential_generators, axis=0) == 1
verified_generators_mask = np.any(potential_generators[:, features_mask], axis=1)
verified_generators_for_sample = real_samples_indexes[np.argwhere(verified_generators_mask)[:, 0]]
generator_for_each_synthetic_sample.append(set(verified_generators_for_sample))
public_LB = generator_for_each_synthetic_sample[0]
for x in tqdm(generator_for_each_synthetic_sample):
if public_LB.intersection(x):
public_LB = public_LB.union(x)
private_LB = generator_for_each_synthetic_sample[1]
for x in tqdm(generator_for_each_synthetic_sample):
if private_LB.intersection(x):
private_LB = private_LB.union(x)
print(len(public_LB))
print(len(private_LB))
确实花了很长时间来理解这两个code block,但仍未得其法。 也在原kernel主下评论了我的疑惑,也未获得解答。 故在此希望杰少在此能解释一二,十分感谢。