facenet-pytorch-glint360k icon indicating copy to clipboard operation
facenet-pytorch-glint360k copied to clipboard

triplet_loss_dataloader.py

Open YoonSeongGyeol opened this issue 3 years ago • 3 comments

Hello, I'm daniel, While running your project, one question arose.

In dataloader/triplet_loss_dataloader, It is a system that generates (pos, neg) class randomly as the number of triplets allocated for each processor, and randomly selects images, but, When using the function of np.random.choice, I confirmed that the same random value is outputted for each processor. So I used np.random.RandomState(), and I was able to use a different random value for each processor.

Please let me know if I understand this processor well or not.

Thank you. Daniel

YoonSeongGyeol avatar Oct 27 '20 04:10 YoonSeongGyeol

Hi Daniel,

Thank you very much for catching this one. The intention was only to speed up the triplet generation process and not to re-replicate the generated triplets across the spawned processes, hehe. I have edited the dataloader as you described and the RandomState() object would be initialized with seed=None so every time the seed would be a random number and would then randomly choose the required elements for triplet creation.

To be clear, the current pre-trained model was trained on 10 million generated triplets that were not generated with the multi-processing method.

The reason why I am using the "triplet generation" method is to have some kind of naive reproducibility when changing some training parameters, the intention is to conduct future experiments with a set number of human identities per triplet batch whereby the dataloader would generate and yield a set number of triplets per training iteration instead of a pre-generated list of triplets like with the current version.

However, there are two current issues I am dealing with that you should be aware of before using this project:

1- After some training "epochs", the BatchNorm2D operation would require more VRAM allocation and would cause a CudaOutofMemory Exception. I was training one epoch per day since one epoch was taking around 11 hours on my PC and I would turn off the process when it is done so I would use my PC for other things, so I managed to somehow get the 256 batch size training to work but would cause an OOM if left for several epochs. So I would recommend you use a lower batch size value that would initially allocate around 40-60% of your GPU VRAM.

2- I tried switching to CPU for the iterations that caused the OOM in order to continue training. Unfortunately, switching to CPU has a negative impact on model performance metrics, I still don't know why that is the case so far.

Again, thank you very much for catching the issue.

tamerthamoqa avatar Oct 27 '20 05:10 tamerthamoqa

Hello.

Thank you for answering my question. In my PC gpu, had TITAN 4ea (12GB), so I used multi-gpu (data-parallel), In fact, a network has 256/4=64 batches. currently, I finished 1-epoch (10,000,000 triplet num data) approximately 3-hours.

There is no problem at present, and the slightly different point is that the performance is low, but most of them use torch.cuda.empty_cache () to avoid OOM. Now, Training without any problems.

YoonSeongGyeol avatar Oct 27 '20 09:10 YoonSeongGyeol

We may work on this as well. I noticed that the triplet generation is not a very fast process. Probably data-frames are not that fast for this kind of usage.

AGenchev avatar Jan 23 '21 20:01 AGenchev