MetaLearning-TF2.0 Training with custom data gets stuck on the first iteration.

When i am trying to train a model with my custom data, it is stuck with the following output:

No previous checkpoint found!
0it [00:00, ?it/s]

It stays like this until i interrupt the training. I can also see that the GPU is in use but i don't see anything else happening. Is it normal? As far as i know, i should be able to see some logs like accuracy, loss, epoch count etc.

Jan 05 '21 23:01 SamiurRahman1

Yes, you should be able to see some logs here. Can you please check if a folder is created in maml directory for your logs? Also, would that be possible for you to share the code? probably on some fork or other branch here? What about the dataset?

Jan 06 '21 18:01 siavash-khodadadeh

a folder is actually created in the master folder as i moved my script in the master folder before i ran it. Of course i would love to share my code and dataset.

Jan 06 '21 22:01 SamiurRahman1

i have created a fork and uploaded my code and here is a link for the dataset and logfiles. Thank you again for your help.

Dataset: https://drive.google.com/file/d/1l37mCGycof3qI58gvjfDJ5tCE_cISEGz/view?usp=sharing

log file: https://drive.google.com/file/d/1yMrQlwn9AqaVWvOBuBt-jyCYPPAEsn27/view?usp=sharing

Jan 06 '21 22:01 SamiurRahman1

Thanks! Please share the link to the fork with me. I will look into it soon, however, I am a little busy with some other projects now so it might take a couple of days before I get back to you.

Jan 06 '21 23:01 siavash-khodadadeh

here is the link of the repo https://github.com/SamiurRahman1/MetaLearning-TF2.0

Jan 06 '21 23:01 SamiurRahman1

Can you please point me to the python files you added?

Jan 06 '21 23:01 siavash-khodadadeh

Sorry, not very familiar with github generally. Here are the script links: https://drive.google.com/file/d/1183N5iO-fZugJ6_lXetZuno53W-uxrLS/view?usp=sharing, https://drive.google.com/file/d/1XelBlei1WLE3o5kOnzho4jupmpQmGng9/view?usp=sharing

Jan 07 '21 00:01 SamiurRahman1

Hello, did you have some time to look at the code?

Jan 15 '21 10:01 SamiurRahman1

Hi, i am still waiting for reply on this if possible.

Jan 25 '21 22:01 SamiurRahman1

Hello,

I have been a little bit busy. Unfortunately, I cannot check this from python files you shared on google drive since it is hard to track changes. Please put everything on Github and I can check out to that particular branch and debug it.

Thank you very much, Siavash

Jan 25 '21 23:01 siavash-khodadadeh

Hi, uploading or creating a branch is disabled for your repo, for obvious reasons. I created a pull request and uploaded the files. Maybe you can find them there? If not, could you please tell me exactly how i should upload them?

Jan 26 '21 08:01 SamiurRahman1

Sorry for my late reply. I have been busy. I looked at the codes. It seems to me that your dataset has only 6 classes? Is that correct? In that case, do you want to do meta-learning on it or do you want to just use it for the test? If you want to do meta-learning, you need to have different tasks and your meta-batch-size is 4 and n is 5 which means you need at least 20 classes. However, I guess the program should check this before running and give an appropriate error message. Please let me know if this is the case. One way you can try it is to set meta-batch-size=1 and see if the program still stuck. Thanks again for using this repo.

Feb 18 '21 14:02 siavash-khodadadeh

Hi, thanks for your reply. I'm trying to train a model with the dataset. I'll try your suggestion and get back to you.

Feb 21 '21 00:02 SamiurRahman1

Hi, so i tried running the training with meta-batch-size=1 but unfortunately it still gets stuck. Is there anything else i need to change if i want to train the model with only 6 classes?

Feb 21 '21 09:02 SamiurRahman1

Okay, I see in the dataset class there are 4 classes for training and 2 classes for validation. In this case, there is no way to generate 5-way tasks during training because there are only 4 classes for training. Can you please try using all 6 classes for training and all 6 classes as well for validation and test to just see if that is the problem? Also, you can set n=4 instead of 5 with meta-batch-size=1, but again since your validation has just two classes, I think you might get the same problem.

Feb 22 '21 16:02 siavash-khodadadeh

When i set n=4 , num_train_classes=6, num_val_classes=6, it throws the following error:

ValueError: in user code:

    /data/yali/sam/Project/MetaLearning-TF2.0-master/models/base_model.py:254 meta_train_loop  *
        task_final_acc, task_final_loss = self.get_losses_of_tasks_batch(method='train')(
    /data/yali/sam/Project/MetaLearning-TF2.0-master/models/maml/maml.py:237 inner_train_loop  *
        self.create_meta_model(self.updated_models[0], self.model, gradients)
    /data/yali/sam/Project/MetaLearning-TF2.0-master/models/maml/maml.py:133 create_meta_model  *
        model_layer = model_layer.get_layer(layer_name)
    /home/lili/miniconda3/lib/python3.8/site-packages/tensorflow/python/keras/engine/training.py:2398 get_layer  **
        raise ValueError('No such layer: ' + name + '.')

    ValueError: No such layer: simple_model.

Feb 22 '21 17:02 SamiurRahman1

I do not think you should set num_train_classes and num_val_classes to 6 because that means you have at least 12 classes and the others are for the test. Can you please make sure your function def get_train_val_test_folders(self) -> Tuple: returns the same classes. Let me try to write it here:

def get_train_val_test_folders(self) -> Tuple:
        damageTypes = list()
        myDir = "/data/yali/sam/Project/MetaLearning-TF2.0-master/data/Family/"
        for item in os.listdir(myDir):
            damageTypes.append(item)

        #print(damageTypes)
        damageImg = list()

        for damage in damageTypes:
            damageImg.append(os.path.join(myDir+damage))
        damageImg.sort()

        num_train_classes = self.num_train_classes
        num_val_classes = self.num_val_classes

        random.shuffle(damageImg)
        train_chars = damageImg
        val_chars = damageImg
        test_chars = damageImg
        #print(train_chars)


        train_classes = {char: [os.path.join(char, instance) for instance in os.listdir(char)] for char in train_chars}
        val_classes = {char: [os.path.join(char, instance) for instance in os.listdir(char)] for char in val_chars}
        test_classes = {char: [os.path.join(char, instance) for instance in os.listdir(char)] for char in test_chars}
        return train_classes, val_classes, test_classes

Feb 23 '21 15:02 siavash-khodadadeh

I made the changes to the function as you suggested, set n=4 , num_train_classes=3, num_val_classes=3. Getting the same error unfortunately.

Feb 23 '21 18:02 SamiurRahman1

Can you run maml_omniglot?

Feb 23 '21 19:02 siavash-khodadadeh

with the omniglot dataset you mean?

Feb 23 '21 19:02 SamiurRahman1

hmm, interestingly, i get the same error when i run maml_omniglot.py

Feb 23 '21 19:02 SamiurRahman1

It seems to be something from TF version. What is your TensorFlow version? I can run maml_omniglot.py with tf 2.2.0-rc2

Feb 23 '21 19:02 siavash-khodadadeh

my tf version is 2.3.1

Feb 23 '21 20:02 SamiurRahman1

I encounter the same problem when the TF version is 2.3.1. Has this error been solved?

Aug 11 '21 13:08 WuJi1

The code currently works with TF version 2.2.0-rc2. Would be glad to get a merge request to update the version if you are interested.

Aug 11 '21 14:08 siavash-khodadadeh

MetaLearning-TF2.0 MetaLearning-TF2.0 copied to clipboard

Training with custom data gets stuck on the first iteration.

MetaLearning-TF2.0
MetaLearning-TF2.0 copied to clipboard