factoextra icon indicating copy to clipboard operation
factoextra copied to clipboard

Sampling of the rows of data is not uniform

Open kwstat opened this issue 3 years ago • 0 comments

In this function, k is a vector of row indexes that represent the sample rows of the data. Currently: k <- round(runif(n, 1, nrow(data))) However, this does NOT use an equal probability to sample rows. For example:

table(round(runif(10000, 1, 10)))
#   1    2    3    4    5    6    7    8    9   10 
# 532 1083 1138 1087 1116 1109 1111 1133 1132  559

The first and last rows of the data are only sampled half as often as the other rows of the data.

The proposed fix samples all rows with equal probability:

table(sample(1:10, 10000, replace=TRUE))
#    1    2    3    4    5    6    7    8    9   10 
# 1032  975 1020 1021  962 1009 1064  949  962 1006

kwstat avatar Apr 01 '21 21:04 kwstat