MetaLearning-TF2.0
MetaLearning-TF2.0 copied to clipboard
Training with custom data gets stuck on the first iteration.
When i am trying to train a model with my custom data, it is stuck with the following output:
No previous checkpoint found!
0it [00:00, ?it/s]
It stays like this until i interrupt the training. I can also see that the GPU is in use but i don't see anything else happening. Is it normal? As far as i know, i should be able to see some logs like accuracy, loss, epoch count etc.
Yes, you should be able to see some logs here. Can you please check if a folder is created in maml directory for your logs? Also, would that be possible for you to share the code? probably on some fork or other branch here? What about the dataset?
a folder is actually created in the master folder as i moved my script in the master folder before i ran it. Of course i would love to share my code and dataset.
i have created a fork and uploaded my code and here is a link for the dataset and logfiles. Thank you again for your help.
Dataset: https://drive.google.com/file/d/1l37mCGycof3qI58gvjfDJ5tCE_cISEGz/view?usp=sharing
log file: https://drive.google.com/file/d/1yMrQlwn9AqaVWvOBuBt-jyCYPPAEsn27/view?usp=sharing
Thanks! Please share the link to the fork with me. I will look into it soon, however, I am a little busy with some other projects now so it might take a couple of days before I get back to you.
here is the link of the repo https://github.com/SamiurRahman1/MetaLearning-TF2.0
Can you please point me to the python files you added?
Sorry, not very familiar with github generally. Here are the script links: https://drive.google.com/file/d/1183N5iO-fZugJ6_lXetZuno53W-uxrLS/view?usp=sharing, https://drive.google.com/file/d/1XelBlei1WLE3o5kOnzho4jupmpQmGng9/view?usp=sharing
Hello, did you have some time to look at the code?
Hi, i am still waiting for reply on this if possible.
Hello,
I have been a little bit busy. Unfortunately, I cannot check this from python files you shared on google drive since it is hard to track changes. Please put everything on Github and I can check out to that particular branch and debug it.
Thank you very much, Siavash
Hi, uploading or creating a branch is disabled for your repo, for obvious reasons. I created a pull request and uploaded the files. Maybe you can find them there? If not, could you please tell me exactly how i should upload them?
Sorry for my late reply. I have been busy. I looked at the codes. It seems to me that your dataset has only 6 classes? Is that correct? In that case, do you want to do meta-learning on it or do you want to just use it for the test? If you want to do meta-learning, you need to have different tasks and your meta-batch-size is 4 and n is 5 which means you need at least 20 classes. However, I guess the program should check this before running and give an appropriate error message. Please let me know if this is the case. One way you can try it is to set meta-batch-size=1 and see if the program still stuck. Thanks again for using this repo.
Hi, thanks for your reply. I'm trying to train a model with the dataset. I'll try your suggestion and get back to you.
Hi, so i tried running the training with meta-batch-size=1 but unfortunately it still gets stuck. Is there anything else i need to change if i want to train the model with only 6 classes?
Okay, I see in the dataset class there are 4 classes for training and 2 classes for validation. In this case, there is no way to generate 5-way tasks during training because there are only 4 classes for training. Can you please try using all 6 classes for training and all 6 classes as well for validation and test to just see if that is the problem? Also, you can set n=4 instead of 5 with meta-batch-size=1, but again since your validation has just two classes, I think you might get the same problem.
When i set n=4 , num_train_classes=6, num_val_classes=6
, it throws the following error:
ValueError: in user code:
/data/yali/sam/Project/MetaLearning-TF2.0-master/models/base_model.py:254 meta_train_loop *
task_final_acc, task_final_loss = self.get_losses_of_tasks_batch(method='train')(
/data/yali/sam/Project/MetaLearning-TF2.0-master/models/maml/maml.py:237 inner_train_loop *
self.create_meta_model(self.updated_models[0], self.model, gradients)
/data/yali/sam/Project/MetaLearning-TF2.0-master/models/maml/maml.py:133 create_meta_model *
model_layer = model_layer.get_layer(layer_name)
/home/lili/miniconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:2398 get_layer **
raise ValueError('No such layer: ' + name + '.')
ValueError: No such layer: simple_model.
I do not think you should set num_train_classes and num_val_classes to 6 because that means you have at least 12 classes and the others are for the test. Can you please make sure your function def get_train_val_test_folders(self) -> Tuple: returns the same classes. Let me try to write it here:
def get_train_val_test_folders(self) -> Tuple:
damageTypes = list()
myDir = "/data/yali/sam/Project/MetaLearning-TF2.0-master/data/Family/"
for item in os.listdir(myDir):
damageTypes.append(item)
#print(damageTypes)
damageImg = list()
for damage in damageTypes:
damageImg.append(os.path.join(myDir+damage))
damageImg.sort()
num_train_classes = self.num_train_classes
num_val_classes = self.num_val_classes
random.shuffle(damageImg)
train_chars = damageImg
val_chars = damageImg
test_chars = damageImg
#print(train_chars)
train_classes = {char: [os.path.join(char, instance) for instance in os.listdir(char)] for char in train_chars}
val_classes = {char: [os.path.join(char, instance) for instance in os.listdir(char)] for char in val_chars}
test_classes = {char: [os.path.join(char, instance) for instance in os.listdir(char)] for char in test_chars}
return train_classes, val_classes, test_classes
I made the changes to the function as you suggested, set n=4 , num_train_classes=3, num_val_classes=3
. Getting the same error unfortunately.
Can you run maml_omniglot?
with the omniglot dataset you mean?
hmm, interestingly, i get the same error when i run maml_omniglot.py
It seems to be something from TF version. What is your TensorFlow version? I can run maml_omniglot.py with tf 2.2.0-rc2
my tf version is 2.3.1
I encounter the same problem when the TF version is 2.3.1. Has this error been solved?
The code currently works with TF version 2.2.0-rc2. Would be glad to get a merge request to update the version if you are interested.