tonic icon indicating copy to clipboard operation
tonic copied to clipboard

Seemingly random ordering of samples in NMNIST dataset when running on Google Colab

Open verilog-indeed opened this issue 1 year ago • 4 comments

Hello!

While working with the NMNIST dataset on Tonic, we noticed that indexing the dataset object gave different results to what we used to get on Colab a while back, and different from what we obtained running locally. The dataset used to list the samples in ascending order based on target, from target '0' to target '9'. However, this order appears to have been completely broken by a recent change in the Colab environment. I believe I managed to narrow down a potential cause for this issue, it appears to be tied to a change in the behavior of os.walk() function which caused the path enumeration to become unordered, to better explain what I mean I recreated a portion of the NMNIST constructor that is responsible for organizing the data after it is downloaded:

import os
import tonic

#Downloads the dataset
testset = tonic.datasets.NMNIST(save_to='./', train=False)
#example target
print("Target of zeroth sample:", testset[0][1])

#Approximately recreates a portion of NMNIST.__init__()
#The portion in question is found at line 99 of nmnist.py
file_path = './NMNIST/Test'
data = []
targets = []
for path, dirs, files in os.walk(file_path):
    print(path)
    files.sort()
    for file in files:
        if file.endswith("bin"):
            data.append(path + "/" + file)
            label_number = int(path[-1])
            targets.append(label_number)

I ran this snippet as a cell in a notebook created with VSCode using Python 3.10.11 (Windows 10), which gave the following output:

Target of zeroth sample: 0
./NMNIST/Test
./NMNIST/Test\0
./NMNIST/Test\1
./NMNIST/Test\2
./NMNIST/Test\3
./NMNIST/Test\4
./NMNIST/Test\5
./NMNIST/Test\6
./NMNIST/Test\7
./NMNIST/Test\8
./NMNIST/Test\9

Running the exact same snippet on a fresh Colab notebook using Python 3.10.12 (after installing Tonic and running once in order to downloading the dataset, not shown here) yields the following output:

Target of zeroth sample: 6
./NMNIST/Test
./NMNIST/Test/6
./NMNIST/Test/8
./NMNIST/Test/3
./NMNIST/Test/2
./NMNIST/Test/7
./NMNIST/Test/0
./NMNIST/Test/1
./NMNIST/Test/5
./NMNIST/Test/9
./NMNIST/Test/4      

To the best of our knowledge, the dataset elements are still labelled correctly and point to valid .bin filenames, but we're a bit short on time unfortunately so further investigation might be necessary. We'd love to answer any inquiry with the best of our abilities if it can help address this issue.

verilog-indeed avatar May 24 '24 19:05 verilog-indeed

I had this issue of random sample orders in the past, that's why I put the files.sort(), but now it seems as if the folder path is also random. I noticed that in your example on Windows it prints ./NMNIST/Test\0, whereas on Google Colab / Linux it prints ./NMNIST/Test/0, notice the different trailing slashes. Could you please try adding a dirs.sort() just above files.sort()?

biphasic avatar May 28 '24 19:05 biphasic

I had this issue of random sample orders in the past, that's why I put the files.sort(), but now it seems as if the folder path is also random. I noticed that in your example on Windows it prints ./NMNIST/Test\0, whereas on Google Colab / Linux it prints ./NMNIST/Test/0, notice the different trailing slashes. Could you please try adding a dirs.sort() just above files.sort()?

I just added the dirs.sort() line as you suggested and it seems to print the directories in the correct order now!

file_path = './NMNIST/Test'
data = []
targets = []
for path, dirs, files in os.walk(file_path):
    print(path)
    dirs.sort()
    files.sort()
    for file in files:
        if file.endswith("bin"):
            data.append(path + "/" + file)
            label_number = int(path[-1])
            targets.append(label_number)

Output:

./NMNIST/Test
./NMNIST/Test/0
./NMNIST/Test/1
./NMNIST/Test/2
./NMNIST/Test/3
./NMNIST/Test/4
./NMNIST/Test/5
./NMNIST/Test/6
./NMNIST/Test/7
./NMNIST/Test/8
./NMNIST/Test/9

(I ran the same snippet locally and the order was still correct, just to be safe) Interestingly, re-running the original snippet on Colab today gave me a completely different (still broken) order of files, not sure what can make it so non-deterministic but I'm not super experienced with Python OS calls.

verilog-indeed avatar May 29 '24 10:05 verilog-indeed

Great! Can you open a PR with the suggested change in the NMNIST dataset?

biphasic avatar May 29 '24 10:05 biphasic

Great! Can you open a PR with the suggested change in the NMNIST dataset?

I'd love to, I'll first try to build tonic with the change and test it just in case, then I'll PR it. Thank you for all the hard work you do on tonic btw!

verilog-indeed avatar May 29 '24 10:05 verilog-indeed