imd
imd copied to clipboard
Different results on the same arrays
Hi, thanks for your good work.
I would like to try MSID for GANs evaluation, but found that the metric is extremely unstable. I just clone your repo and perform a simple experiment:
>>> import numpy as np
>>> from msid import msid_score
>>> x0 = np.random.randn(1000, 10)
>>> x1 = np.random.randn(1000, 10)
>>> for _ in range(20):
... print('MSID(x0, x1)', msid_score(x0, x1))
...
MSID(x0, x1) 11.612343854772956
MSID(x0, x1) 7.671366682093675
MSID(x0, x1) 1.8117880712326395
MSID(x0, x1) 6.205967034975149
MSID(x0, x1) 1.9430385102291492
MSID(x0, x1) 2.467981390832042
MSID(x0, x1) 4.359253678580822
MSID(x0, x1) 5.705092418121339
MSID(x0, x1) 7.084854325912502
MSID(x0, x1) 8.925101261419211
MSID(x0, x1) 2.6563495105769963
MSID(x0, x1) 6.67076587871034
MSID(x0, x1) 0.9609276170219742
MSID(x0, x1) 4.134198699891847
MSID(x0, x1) 2.061919358106404
MSID(x0, x1) 5.4849779235186045
MSID(x0, x1) 2.5738576367295107
MSID(x0, x1) 3.597934029471011
MSID(x0, x1) 1.0966421686877845
MSID(x0, x1) 13.116242604321098
Next, I provide the results of MSID(x0, x0)
>>> for _ in range(20):
... print('MSID(x0, x0)', msid_score(x0, x0))
...
MSID(x0, x0) 1.8842238243396114
MSID(x0, x0) 6.653959832884025
MSID(x0, x0) 2.896296044612713
MSID(x0, x0) 1.7874406866486243
MSID(x0, x0) 2.212118637843133
MSID(x0, x0) 5.352864291155722
MSID(x0, x0) 4.492301054567285
MSID(x0, x0) 1.3662656634830224
MSID(x0, x0) 2.2663591630199416
MSID(x0, x0) 4.5750399290303045
MSID(x0, x0) 4.094359241800621
MSID(x0, x0) 2.4488511702991795
MSID(x0, x0) 5.929584568192836
MSID(x0, x0) 7.591811322838174
MSID(x0, x0) 7.372357733571717
MSID(x0, x0) 5.6968201645123075
MSID(x0, x0) 1.4797792557903116
MSID(x0, x0) 1.1783656760547234
MSID(x0, x0) 7.6904604926511295
MSID(x0, x0) 5.483936755815125
I expect that a metric for GANs evaluation should:
- Give the same (or at least similar) score when I compute it two times on the same data.
- MSID(x0, x0) < MSID(x0, x1) for any x0 and x1: x0 != x1
Could you please check your implementation? Thank you.
The shape of these two random arrays is (approximately) the same. The ~3-5 score you see is due to randomness of the approximation.
@xgfs Thanks for your work, I have same question
I am also getting similar problem. I was trying this metric for Evaluating difference between two layers of CNNs, I found that even if i give similar/same arrays to the function "msid_score()" i get different results when ran multiple times. Also, for same arrays, i was expecting it to be lower than other values but this is not the case.
Can you please elaborate about this further. ? Also, when you run for similar arrays multiple times, the score keeps changing.
The results are based on a random approximation, the default value of niters=100 gives reasonable default results for many datasets (as used in the paper). If you want more stable predictions, you can either increase the niters to some other appropriate value, or run the function multiple and measure the average/std of the results. These two approaches are mathematically equivalent, first one should be a bit faster.
Hope that helps.
Hi, I tried both of your proposed solutions, the results came close to being stable but they still fluctuate to different valules even for the same arrays.
Do you mean that the metric has multi-mode distribution? If so, this is not expected, and we haven't observed this effect. If you can publish the arrays in question, I can try to investigate.
model_resnet_dct_layer1_3_conv2.zip
This is the pickled file of the array I am using. Its a feature vector from a layer. So, i just transpose it from (1,128,56,56) to (3136, 128) and then calculate the score. I am also pasting the code below for your reference.
act1 = np.transpose(acts1, (0, 2,3, 1)) num_datapoints,h, w, channels = act.shape f_acts = act.reshape((num_datapointshw, channels))
for i in range(20): print( msid_score(f_acts, f_acts,niters=1000))
I also find the score to be too unstable to be used for GAN evaluation. Table 1 in the paper indeed shows that IMD has larger variance than FID/KID, but the variance I observed is even significantly larger than the variance in Table 1.