DG-Net icon indicating copy to clipboard operation
DG-Net copied to clipboard

Why the "val" data-set is subset of "train" data-set?

Open bhooshan-supe-gmail opened this issue 4 years ago • 12 comments

Hi Xiaodong Yang, Zhedong Zheng,

I am planning to use your model in one of our experimental project as base model for transfer learning. And while studying I have noticed that your "val" (validation) dtat-set is subset of "train" (training) data-set. (Refer https://github.com/NVlabs/DG-Net/blob/master/prepare-market.py#L111)

And I believe that it is quite against my understanding. So kindly explain why you have decided to have " validation data-set as subset of training data-set" ?

bhooshan-supe-gmail avatar Mar 12 '20 20:03 bhooshan-supe-gmail

BTW, I am software engineer at LG Electronics US.

bhooshan-supe-gmail avatar Mar 12 '20 20:03 bhooshan-supe-gmail

Hi @bhooshan-supe-gmail Yes. Since the original dataset do not provide the validation set, we split the validation set from the training set.

layumi avatar Mar 12 '20 23:03 layumi

@layumi I am sorry to be nit picky but you have not split the data-set, but you have some part of training data-set duplicated as validation data-set. On the other hand, I have made sure that in my data-set training and validation data-set are completely disjoint sets. And the side effect of that is my training and validation curves are not converging. Please refer following image. train

So I am wondering is this OK? Is this training reliable?

bhooshan-supe-gmail avatar Mar 13 '20 03:03 bhooshan-supe-gmail

Hi @bhooshan-supe-gmail

  1. Please check this line https://github.com/NVlabs/DG-Net/blob/master/prepare-market.py#L111 There are no-overlapping images between the training and validation set. If you use train-all, there will be the overlapping images.

  2. I do not know how you split the dataset. Actually, there are two ways to split the dataset.

  • One easy way is as shown in above. We select the first image of every class in the training set as the validation set. We evaluate the performance in a Classification style.

  • Another way is in Retrieval style. Given 751 classes in the Market-1501 dataset, we split the first 651 classes as training set and leave out the 100 classes as validation set. We could use the images of 100 classes as query and gallery to evaluate the retrieval performance. However, since 100 classes have not been seen by the model, the model could not classify the images of the 100 classes.

layumi avatar Mar 13 '20 03:03 layumi

Hi @layumi

To be honest I am quite new to computer-vision and machine learning. Thanks a lot for your guidance!

bhooshan-supe-gmail avatar Mar 13 '20 17:03 bhooshan-supe-gmail

Hi @layumi ,

We have our own but very small data-set (about 21 person-ids but about 1500 images). And I am fine tuning on your model using our data-set. Basically we are looking into how we can re-identify person from almost top-view (from a very steep angle) instead of side and/or front view.

bhooshan-supe-gmail avatar Mar 13 '20 17:03 bhooshan-supe-gmail

@bhooshan-supe-gmail You may start from my tutorial, which is more straight forward https://github.com/layumi/Person_reID_baseline_pytorch/tree/master/tutorial

And recently I release a dataset and code for satellite-view, drone-view, ground-view geo-localization.
You are welcomed to check out it. https://github.com/layumi/University1652-Baseline

layumi avatar Mar 13 '20 22:03 layumi

Another way is in Retrieval style. Given 751 classes in the Market-1501 dataset, we split the first 651 classes as training set and leave out the 100 classes as validation set. We could use the images of 100 classes as query and gallery to evaluate the retrieval performance. However, since 100 classes have not been seen by the model, the model could not classify the images of the 100 classes.

How would you go about adding this retrieval style evaluation? does it make sense here to add retrieval style evaluation in addition to classification evaluation which makes the model to classify images to person/object ids?

nikky4D avatar Apr 20 '22 14:04 nikky4D

Hi @nikky4D Sorry. What is 00 classes? Could you provide more descriptions?

layumi avatar Apr 20 '22 14:04 layumi

Sorry, I quoted it incorrectly, please see edited comment above

nikky4D avatar Apr 20 '22 15:04 nikky4D

Hi @nikky4D

  1. Validation (Classification Setting) I write it with the training code. You do not need to modify the split.

  2. Validation (Retrieval Setting) If you want to evaluate on 651 / 100 split (751 ID in total), you need to modified the prepare data to split it. Since the id is random, I simply use the first 651 ID as train and late 100 ID as val. For validation on retrieval, you need to use the test.py to test the validation like the test setting. (The validation result during the training is not correct. )

layumi avatar Apr 21 '22 02:04 layumi

Thank you. Then for the teacher training, is it better to use the retrieval split or classification setting for a more robust dg-net setup or does the dataset setup not matter in the final model?

nikky4D avatar Apr 21 '22 03:04 nikky4D