the statistics of Shakespeare dataset is Inconsistent with the paper's description
I use the following script to generate the shakespeare data.
./preprocess.sh -s niid --sf 1.0 -k 0 -t sample -tf 0.8
The statistics is: ###################################
DATASET: shakespeare
557 users
2177224 samples (total)
3908.84 samples per user (mean)
num_samples (std): 7226.23
num_samples (std/mean): 1.85
num_samples (skewness): 4.38
num_sam num_users
0 336
2000 77
4000 43
6000 17
8000 24
10000 13
12000 14
14000 5
16000 5
18000 2
But the paper shows the Shakespeare has 2288 users.
Since I am rushing a paper based on LEAF dataset, could you help to fix this problem? Thanks!
After running the command, I got the following results. I wonder what is wrong here? ./preprocess.sh -s iid --sf 1.0 -k 0 -t sample -tf 0.8 DATASET: shakespeare 424 users 1992135 samples (total) 4698.43 samples per user (mean) num_samples (std): 10122.73 num_samples (std/mean): 2.15 num_samples (skewness): 6.69
num_sam num_users 0 250 2000 66 4000 18 6000 16 8000 18 10000 13 12000 11 14000 2 16000 5 18000 3
The Project Gutenberg EBook we use to extract the Shakespeare data has changed. I just updated the relevant pre-processing script to point to a similar version of the file, but the statistics have indeed changed (they will be updated in a new version of the preprint we are working on). Right now, running the same command as @chaoyanghe, I am getting:
#################################### DATASET: shakespeare 1129 users 4226158 samples (total) 3743.28 samples per user (mean) num_samples (std): 6212.26 num_samples (std/mean): 1.66 num_samples (skewness): 3.35
num_sam num_users 0 705 2000 126 4000 72 6000 56 8000 38 10000 33 12000 31 14000 16 16000 8 18000 11
@scaldas Hi, Thanks for your reply. I wait for a long time...
I also found the FMNIST can not aligh to your statistics: (venv) (base) chaoyanghe-hostname:femnist chaoyanghe$ sh stats.sh #################################### DATASET: femnist 3500 users 791913 samples (total) 226.26 samples per user (mean) num_samples (std): 89.12 num_samples (std/mean): 0.39 num_samples (skewness): 0.77
num_sam num_users 0 1 20 4 40 11 60 5 80 15 100 65 120 122 140 392 160 1237 180 322 200 44 220 52 240 87 260 92 280 116 300 157 320 156 340 181 360 166 380 147 400 87 420 36 440 3 460 1 480 0
Could you also help to check the reason? Since I will cite your paper I need to claim we use the same dataset.
@chaoyanghe I will look into this, but if your work is time-sensitive, consider using the FEMNIST version hosted at Tensorflow Federated (they call it EMNIST). They host their own (slightly different) version and thus don't have the problem of mutating sources (which I believe is the issue here as well).
https://www.tensorflow.org/federated/api_docs/python/tff/simulation/datasets/emnist
We will look into hosting our own version of the datasets in the future.
@scaldas I just tried to get a fresh FEMNIST data set and I am only getting 1900 users instead of before 3500. Was that data set changed as well?
@Enehta Unfortunately, at the time we are only hosting preprocessing scripts for data that is hosted elsewhere. If that data mutates, our resulting scripts also mutate. We are actively working on solving this through our own hosting of the datasets. In the meantime, consider using the FEMNIST version hosted at Tensorflow Federated (they call it EMNIST). They host their own (slightly different) version and thus don't have the problem of mutating sources.
https://www.tensorflow.org/federated/api_docs/python/tff/simulation/datasets/emnist
Interestingly, I found there are 3500 users and totally 803267 samples in the FEMNIST dataset. #################################### DATASET: femnist 3500 users 803267 samples (total) 229.50 samples per user (mean) num_samples (std): 89.03 num_samples (std/mean): 0.39 num_samples (skewness): 0.71