Unsupervised-Classification
Unsupervised-Classification copied to clipboard
Train on own dataset
Hello, thanks for sharing this work. Is there instruction to train on our own dataset? Thanks
+1
I have been working on this for a while. After a huge hiccup I decided to start over and document it. So far I have been able to get pretext working with a custom dataset and the semantic clustering step is currently running. If you want I can share my document once everything is up and running. Hoping to have it done this weekend. @fernandorovai @showkeyjar
come on! look forward to it!
+1
The most important thing u need to do is to implement ur own dataset.py. You can paste a copy of like cifar10.py in the directory. Then edit configs/env.yml to set ouput folder.
The most important thing u need to do is to implement ur own dataset.py. You can paste a copy of like cifar10.py in the directory. Then edit configs/env.yml to set ouput folder.
I see, but those example all have labels, I expect a example that completely no labels.
It isn't a polished product but this is what I have: https://github.com/hkhailee/GOLDEN_unsupervised/blob/main/TrainingYourOwnDataset.md
@hkhailee very good, and thanksfull, I'll have a try.
It isn't a polished product but this is what I have: https://github.com/hkhailee/GOLDEN_unsupervised/blob/main/TrainingYourOwnDataset.md
I can't open the link
It isn't a polished product but this is what I have: https://github.com/hkhailee/GOLDEN_unsupervised/blob/main/TrainingYourOwnDataset.md
I can't open the link
Thats the correct path. If having issues try going directly to the search bar and looking for the GOLDEN_unsupervised repository under user hkhailee, the file is TrainingYourOwnDataset.md on the main branch.
Or you can even click on a users name (such as mine) and view the repositories, and open the GOLDEN_unsupervised. Or you can even manually copy and paste the link in another browser window.
@hkhailee The text of your link is correct whereas the embedded url is not (where people click and are directed to somewhere else). You can change it by "Add a link."
The dataloader have target variable. SInce I am using unlabled data so I do not have label. How to handle that? I am keeping target = 0 to make code working. simclr and scan code is running but selflabel is throwing error "Mask in MaskedCrossEntropyLoss is all zeros". So how the model will work for complete unlabeled training set?
I'm having a problem similar to that of @MotiBaadror . My dataset is very complex and has no labels. how do i cluster my data like this?
@cilemafacan When I run the simclr.py for my dataset then this step would save npz file. If I open that file then the nearest neighbors are same for all the examples, did you encounter the same problem?
@cilemafacan When I run the simclr.py for my dataset then this step would save npz file. If I open that file then the nearest neighbors are same for all the examples, did you encounter the same problem?
I am creating a file for my own dataset, similar to the stl10 dataset file in the data folder. My question is that my data doesn't have any labels. I get all 2 by default instead of labels. This way I can start simclr training. When I examine the resulting .npy file, the nearest neighbors are different. The .npy file looks like this:
array([[ 265, 2049, 109, 1353, 2028, 532, 395, 144, 2084, 1067, 942, 1343, 830, 1054, 2191, 189, 1239, 1738, 501, 123, 619], [ 144, 1414, 1428, 1310, 1064, 1954, 424, 95, 1520, 334, 2145, 1641, 323, 1670, 1543, 538, 920, 1180, 1540, 2050, 1814], [ 145, 279, 921, 1939, 179, 713, 861, 720, 1489, 1005, 1283, 1170, 413, 405, 260, 273, 2305, 2198, 1564, 1818, 289], [1604, 259, 1300, 532, 1680, 1817, 2184, 1428, 1576, 315, 174, 1983, 1128, 1753, 1733, 40, 893, 889, 748, 1255, 2046]....
I'm not sure I'm doing it right. What I don't understand is the contrastive_evaluate step on line 120 in simclr.py file. I don't understand exactly what is being done in this step. I'm getting an error at this step because my data is unlabeled.
How did you guys set up your models and get_item()
here is my get_item
def __getitem__(self, index):
# sample = self.dataset.__getitem__(index)
# image = sample['image']
sample ={}
image = self.dataset.__getitem__(index)[0]
sample['target'] = 1
sample['image'] = self.image_transform(image)
sample['image_augmented'] = self.augmentation_transform(image)
return sample
This is how I am defining my dataset
dataset = torchvision.datasets.ImageFolder('data/my_data',transform=transform)
I am running simclr.py for my dataset. I do not have label so I am seeting target =1 to make the code running
Not sure since I only worked with moco.py but for mine:
calling the dataset:
elif p['train_db_name'] == 'rico-20':
from data.rico import RICO20
subset_file = ''
dataset = RICO20(subset_file=subset_file, split='train', transform=transform)
RICO20:
class RICO20(datasets.ImageFolder):
def __init__(self, subset_file, root=MyPath.db_root_dir('rico-20'), split='train', transform=None):
super(RICO20, self).__init__(root=os.path.join(root, '%s/' %(split)),
transform=None)
self.transform = transform
self.split = split
self.resize = tf.Resize(256)
def __len__(self):
return len(self.imgs)
def __getitem__(self, index):
path, target = self.imgs[index]
with open(path, 'rb') as f:
img = Image.open(f).convert('RGB')
im_size = img.size
img = self.resize(img)
if self.transform is not None:
img = self.transform(img)
out = {'image': img, 'target': target, 'meta': {'im_size': im_size, 'index': index, 'path':path}}
return out
def get_image(self, index):
path, target = self.imgs[index]
with open(path, 'rb') as f:
img = Image.open(f).convert('RGB')
img = self.resize(img)
return img
My model parameters for pretext:
setup: moco # MoCo is used here
backbone: resnet50
model_kwargs:
head: mlp
features_dim: 128
train_db_name: rico-20
val_db_name: rico-20
num_classes: 20
temperature: 0.07
batch_size: 128
num_workers: 8
transformation_kwargs:
crop_size: 224
normalize:
mean: [0.485, 0.456, 0.406]
std: [0.229, 0.224, 0.225]
I separated my data into 3 groups. Train, Test and Val. You do not have a Val (mine was only 1k images). Could it be over fitting your clusters?
Also, the number of classes I have listed is 20 however, the classification after clustering them all visually together only found 19 unique clusters, past that there is replication of clustered images (kind of cool).
what will be target in your get items? Do you have labels?
In the repo you created for rico, you give the subset file path in the moco file. Similar to this, I need labels to create a subset file, but my dataset has no labels. Are you giving a label for the train data from your dataset in the get_item() section? I gave int 2 for the labels in the get_item() part that I created for myself, but I gave it only to run simclr. Actually, I don't have such a label.
elif p['train_db_name'] == 'rico-20':
from data.rico import RICO20
subset_file ='/bsuhome/hkiesecker/scratch/imageClassification/GOLDEN/UnsupervisedClassification/data/rico_subsets/%s.txt' %(p['train_db_name'])
dataset = RICO20(subset_file=subset_file, split='train', transform=transform)
The subset file in RICO20 is never used. I never call RICO20_sub which would use a subset file. There is a piece of code labeling all unlabeled images with 255 and that is their target. its in data/stl.py. When I was using moco I have 66k unlabeled images and 1k labeled images to test my results against.
if self.labels is not None:
img, target = self.data[index], int(self.labels[index])
class_name = self.classes[target]
else:
img, target = self.data[index], 255 # 255 is an ignore index
class_name = 'unlabeled'
With moco, the target is coming from the name the folder is in. My train, val and test images are all in sub folders. so train/1 and test/1, my 1k labeled images are also in subfolders with real names for example, val/bare, val/gallery etc. These names arent important. I could have set train/20, test/20 and had my validation folder be labeled 0-19
so you are using 1k enRICO dataset labeled for val dataset. Did I get right?
As anyone created a better repository for training on unlabelled dataset??
Hi @hkhailee, @cilemafacan and @MotiBaadror,
I got the same error (Mask in MaskedCrossEntropyLoss is all zeros) while trying to selflabel. Is there a possible way to selflabel without a need for a labeled validation dataset?
Thank you in advance!
For me (at the moment wiht labeled data), this database-class works:
import sys, os from PIL import Image import cv2 from torch.utils.data import Dataset sys.path.append(os.getcwd())
class OwnDataset(Dataset):
def __init__(self, img_paths, transform=None, class_names = ['bla'], im_size = 128):
self.img_paths = img_paths
self.transform = transform
self.class_names = class_names
self.im_size = im_size
def __len__(self):
return len(self.img_paths)
def __getitem__(self, idx):
img_filepath = self.img_paths[idx]
img = cv2.imread(img_filepath)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (self.im_size, self.im_size))
img = Image.fromarray(img)
if self.transform is not None:
img = self.transform(img)
class_name = os.path.basename(os.path.dirname(img_filepath))
target = self.class_names.index(class_name)
out = {'image': img, 'target': target, 'meta': {'im_size': self.im_size, 'class_name': self.class_names}}
return out
def get_image(self, idx):
img_filepath = self.img_paths[idx]
img = cv2.imread(img_filepath)
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
img = cv2.resize(img, (self.im_size, self.im_size))
img = Image.fromarray(img)
if self.transform is not None:
img = self.transform(img)
return img
For training with adding this for no classe names work: if self.class_names is not None: class_name = os.path.basename(os.path.dirname(img_filepath)) target = self.class_names.index(class_name) else: target = 0 class_name = 'No class' However, model-evaluation without labels is still an issue...
in case anyone comes here looking for a simple script to create an unsupervised visualization from a collection of images, i just published this: https://github.com/brunovianna/collectionview