Unsupervised-Classification Train on own dataset

Hello, thanks for sharing this work. Is there instruction to train on our own dataset? Thanks

Oct 24 '21 14:10 fernandorovai

+1

Nov 04 '21 09:11 showkeyjar

I have been working on this for a while. After a huge hiccup I decided to start over and document it. So far I have been able to get pretext working with a custom dataset and the semantic clustering step is currently running. If you want I can share my document once everything is up and running. Hoping to have it done this weekend. @fernandorovai @showkeyjar

Nov 05 '21 20:11 hkhailee

come on! look forward to it!

Nov 08 '21 03:11 showkeyjar

+1

Nov 10 '21 10:11 brunovianna

The most important thing u need to do is to implement ur own dataset.py. You can paste a copy of like cifar10.py in the directory. Then edit configs/env.yml to set ouput folder.

Nov 15 '21 05:11 zhaotf16

The most important thing u need to do is to implement ur own dataset.py. You can paste a copy of like cifar10.py in the directory. Then edit configs/env.yml to set ouput folder.

I see, but those example all have labels, I expect a example that completely no labels.

Nov 16 '21 09:11 showkeyjar

It isn't a polished product but this is what I have: https://github.com/hkhailee/GOLDEN_unsupervised/blob/main/TrainingYourOwnDataset.md

Nov 19 '21 03:11 hkhailee

@hkhailee very good, and thanksfull, I'll have a try.

Nov 19 '21 06:11 showkeyjar

It isn't a polished product but this is what I have: https://github.com/hkhailee/GOLDEN_unsupervised/blob/main/TrainingYourOwnDataset.md

I can't open the link

Nov 23 '21 10:11 ravit-cohen-segev

It isn't a polished product but this is what I have: https://github.com/hkhailee/GOLDEN_unsupervised/blob/main/TrainingYourOwnDataset.md

I can't open the link

Thats the correct path. If having issues try going directly to the search bar and looking for the GOLDEN_unsupervised repository under user hkhailee, the file is TrainingYourOwnDataset.md on the main branch.

Or you can even click on a users name (such as mine) and view the repositories, and open the GOLDEN_unsupervised. Or you can even manually copy and paste the link in another browser window.

Nov 24 '21 21:11 hkhailee

@hkhailee The text of your link is correct whereas the embedded url is not (where people click and are directed to somewhere else). You can change it by "Add a link."

Dec 17 '21 00:12 wetliu

The dataloader have target variable. SInce I am using unlabled data so I do not have label. How to handle that? I am keeping target = 0 to make code working. simclr and scan code is running but selflabel is throwing error "Mask in MaskedCrossEntropyLoss is all zeros". So how the model will work for complete unlabeled training set?

Dec 23 '21 16:12 MotiBaadror

I'm having a problem similar to that of @MotiBaadror . My dataset is very complex and has no labels. how do i cluster my data like this?

Dec 28 '21 14:12 cilemafacan

@cilemafacan When I run the simclr.py for my dataset then this step would save npz file. If I open that file then the nearest neighbors are same for all the examples, did you encounter the same problem?

Dec 28 '21 14:12 MotiBaadror

@cilemafacan When I run the simclr.py for my dataset then this step would save npz file. If I open that file then the nearest neighbors are same for all the examples, did you encounter the same problem?

I am creating a file for my own dataset, similar to the stl10 dataset file in the data folder. My question is that my data doesn't have any labels. I get all 2 by default instead of labels. This way I can start simclr training. When I examine the resulting .npy file, the nearest neighbors are different. The .npy file looks like this:

array([[ 265, 2049, 109, 1353, 2028, 532, 395, 144, 2084, 1067, 942, 1343, 830, 1054, 2191, 189, 1239, 1738, 501, 123, 619], [ 144, 1414, 1428, 1310, 1064, 1954, 424, 95, 1520, 334, 2145, 1641, 323, 1670, 1543, 538, 920, 1180, 1540, 2050, 1814], [ 145, 279, 921, 1939, 179, 713, 861, 720, 1489, 1005, 1283, 1170, 413, 405, 260, 273, 2305, 2198, 1564, 1818, 289], [1604, 259, 1300, 532, 1680, 1817, 2184, 1428, 1576, 315, 174, 1983, 1128, 1753, 1733, 40, 893, 889, 748, 1255, 2046]....

I'm not sure I'm doing it right. What I don't understand is the contrastive_evaluate step on line 120 in simclr.py file. I don't understand exactly what is being done in this step. I'm getting an error at this step because my data is unlabeled.

Dec 28 '21 14:12 cilemafacan

How did you guys set up your models and get_item()

Dec 28 '21 19:12 hkhailee

here is my get_item


 def __getitem__(self, index):
        # sample = self.dataset.__getitem__(index)
        # image = sample['image']

        sample ={}
        image = self.dataset.__getitem__(index)[0]
        sample['target'] = 1
        
        sample['image'] = self.image_transform(image)
        sample['image_augmented'] = self.augmentation_transform(image)

        return sample

This is how I am defining my dataset

dataset = torchvision.datasets.ImageFolder('data/my_data',transform=transform)

I am running simclr.py for my dataset. I do not have label so I am seeting target =1 to make the code running

Dec 28 '21 19:12 MotiBaadror

Not sure since I only worked with moco.py but for mine:

calling the dataset:

elif p['train_db_name'] == 'rico-20':
        from data.rico import RICO20
        subset_file = ''
        dataset = RICO20(subset_file=subset_file, split='train', transform=transform)

RICO20:


class RICO20(datasets.ImageFolder): 
    def __init__(self, subset_file, root=MyPath.db_root_dir('rico-20'), split='train', transform=None):
        super(RICO20, self).__init__(root=os.path.join(root, '%s/' %(split)),
                                         transform=None)
        self.transform = transform 
        self.split = split
        self.resize = tf.Resize(256)
    def __len__(self):
        return len(self.imgs)


    def __getitem__(self, index):
        path, target = self.imgs[index]
        with open(path, 'rb') as f:
            img = Image.open(f).convert('RGB')
        im_size = img.size
        img = self.resize(img)

        if self.transform is not None:
            img = self.transform(img)

        out = {'image': img, 'target': target, 'meta': {'im_size': im_size, 'index': index, 'path':path}}

        return out

    def get_image(self, index):
        path, target = self.imgs[index]
        with open(path, 'rb') as f:
            img = Image.open(f).convert('RGB')
        img = self.resize(img) 
        return img

My model parameters for pretext:


setup: moco # MoCo is used here

backbone: resnet50
model_kwargs:
   head: mlp
   features_dim: 128

train_db_name: rico-20
val_db_name: rico-20
num_classes: 20
temperature: 0.07

batch_size: 128 
num_workers: 8

transformation_kwargs:
   crop_size: 224
   normalize:
      mean: [0.485, 0.456, 0.406]
      std: [0.229, 0.224, 0.225]

I separated my data into 3 groups. Train, Test and Val. You do not have a Val (mine was only 1k images). Could it be over fitting your clusters?

Also, the number of classes I have listed is 20 however, the classification after clustering them all visually together only found 19 unique clusters, past that there is replication of clustered images (kind of cool).

Dec 28 '21 20:12 hkhailee

what will be target in your get items? Do you have labels?

Dec 28 '21 20:12 MotiBaadror

In the repo you created for rico, you give the subset file path in the moco file. Similar to this, I need labels to create a subset file, but my dataset has no labels. Are you giving a label for the train data from your dataset in the get_item() section? I gave int 2 for the labels in the get_item() part that I created for myself, but I gave it only to run simclr. Actually, I don't have such a label.

elif p['train_db_name'] == 'rico-20':
        from data.rico import RICO20
        subset_file ='/bsuhome/hkiesecker/scratch/imageClassification/GOLDEN/UnsupervisedClassification/data/rico_subsets/%s.txt' %(p['train_db_name'])
        dataset = RICO20(subset_file=subset_file, split='train', transform=transform)

Dec 29 '21 07:12 cilemafacan

The subset file in RICO20 is never used. I never call RICO20_sub which would use a subset file. There is a piece of code labeling all unlabeled images with 255 and that is their target. its in data/stl.py. When I was using moco I have 66k unlabeled images and 1k labeled images to test my results against.


        if self.labels is not None:
            img, target = self.data[index], int(self.labels[index])
            class_name = self.classes[target]
        else:
            img, target = self.data[index], 255 # 255 is an ignore index
            class_name = 'unlabeled'

With moco, the target is coming from the name the folder is in. My train, val and test images are all in sub folders. so train/1 and test/1, my 1k labeled images are also in subfolders with real names for example, val/bare, val/gallery etc. These names arent important. I could have set train/20, test/20 and had my validation folder be labeled 0-19

Dec 29 '21 16:12 hkhailee

so you are using 1k enRICO dataset labeled for val dataset. Did I get right?

Dec 30 '21 11:12 cilemafacan

As anyone created a better repository for training on unlabelled dataset??

Apr 07 '22 12:04 Arunxarvio

Hi @hkhailee, @cilemafacan and @MotiBaadror,

I got the same error (Mask in MaskedCrossEntropyLoss is all zeros) while trying to selflabel. Is there a possible way to selflabel without a need for a labeled validation dataset?

Thank you in advance!

Apr 08 '22 11:04 mzacri

For me (at the moment wiht labeled data), this database-class works:

import sys, os from PIL import Image import cv2 from torch.utils.data import Dataset sys.path.append(os.getcwd())

class OwnDataset(Dataset):

def __init__(self, img_paths, transform=None, class_names = ['bla'], im_size = 128):
    self.img_paths = img_paths
    self.transform = transform
    self.class_names = class_names
    self.im_size = im_size

def __len__(self):
    return len(self.img_paths)

def __getitem__(self, idx):

    img_filepath = self.img_paths[idx]
    img = cv2.imread(img_filepath)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img, (self.im_size, self.im_size))
    img = Image.fromarray(img)
    if self.transform is not None:
        img = self.transform(img)

    class_name = os.path.basename(os.path.dirname(img_filepath))
    target = self.class_names.index(class_name)

    out = {'image': img, 'target': target, 'meta': {'im_size': self.im_size,  'class_name': self.class_names}}

    return out

def get_image(self, idx):
    img_filepath = self.img_paths[idx]
    img = cv2.imread(img_filepath)
    img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
    img = cv2.resize(img, (self.im_size, self.im_size))
    img = Image.fromarray(img)
    if self.transform is not None:
        img = self.transform(img)
    return img

Apr 26 '22 12:04 catweis

For training with adding this for no classe names work: if self.class_names is not None: class_name = os.path.basename(os.path.dirname(img_filepath)) target = self.class_names.index(class_name) else: target = 0 class_name = 'No class' However, model-evaluation without labels is still an issue...

Apr 26 '22 14:04 catweis

in case anyone comes here looking for a simple script to create an unsupervised visualization from a collection of images, i just published this: https://github.com/brunovianna/collectionview

Jun 01 '22 12:06 brunovianna

Unsupervised-Classification Unsupervised-Classification copied to clipboard

Train on own dataset

Unsupervised-Classification
Unsupervised-Classification copied to clipboard