Kaggle_Competition_Treasure icon indicating copy to clipboard operation
Kaggle_Competition_Treasure copied to clipboard

Further questions for the kernel of decting fake samples in Stantander 2019

Open BovenPeng opened this issue 5 years ago • 0 comments

首先非常感谢杰少对于Kaggle竞赛的相关分享。 在此作为一个初入此领域的同学想请问下杰少,关于List of Fake Samples and Public/Private LB split中,能否在查找samples generator部分的两个code block上,进行代码逻辑上的解释吗? 代码如下:

df_test_real = df_test[real_samples_indexes].copy()

generator_for_each_synthetic_sample = []
# Using 20,000 samples should be enough. 
# You can use all of the 100,000 and get the same results (but 5 times slower)
for cur_sample_index in tqdm(synthetic_samples_indexes[:20000]):
    cur_synthetic_sample = df_test[cur_sample_index]
    potential_generators = df_test_real == cur_synthetic_sample

    # A verified generator for a synthetic sample is achieved
    # only if the value of a feature appears only once in the
    # entire real samples set
    features_mask = np.sum(potential_generators, axis=0) == 1
    verified_generators_mask = np.any(potential_generators[:, features_mask], axis=1)
    verified_generators_for_sample = real_samples_indexes[np.argwhere(verified_generators_mask)[:, 0]]
    generator_for_each_synthetic_sample.append(set(verified_generators_for_sample))
public_LB = generator_for_each_synthetic_sample[0]
for x in tqdm(generator_for_each_synthetic_sample):
    if public_LB.intersection(x):
        public_LB = public_LB.union(x)

private_LB = generator_for_each_synthetic_sample[1]
for x in tqdm(generator_for_each_synthetic_sample):
    if private_LB.intersection(x):
        private_LB = private_LB.union(x)
        
print(len(public_LB))
print(len(private_LB))

确实花了很长时间来理解这两个code block,但仍未得其法。 也在原kernel主下评论了我的疑惑,也未获得解答。 故在此希望杰少在此能解释一二,十分感谢。

BovenPeng avatar Apr 15 '19 08:04 BovenPeng