DeepDanbooru icon indicating copy to clipboard operation
DeepDanbooru copied to clipboard

Reproduce deepdanbooru_v3 in PyTorch

Open tellurion-kanata opened this issue 2 years ago • 8 comments

Hi, I'm deeply impressed by your project and want to devote your model to my PyTorch project. But when I used PyTorch to reproduce your project, my model performs much worse than yours after training. It is only able to predict high-frequency tags such as "solo, 1girl, highres", missing tags such as "blue eyes, green hair". It seems to caused by the long tailed distribution of danbooru dataset, but I cannot fix it.

Have you met similar problems before? Did you solve it with some improvements on loss function or some trianing tricks?

I'm using the model custom deepdanbooru_v3 and danbooru2020 dataset with 7000 one-hot labels. optimizier, loss function and other settings are the same as your default ones.

Thank you!

tellurion-kanata avatar Oct 23 '21 10:10 tellurion-kanata

Hi. I originally implemented DeepDanbooru by using Microsoft's CNTK. With TensorFlow, there are some different behaviours to CNTK.

So I testd some parameter changes for TensorFlow:

  1. Increased learning rate: x 1000 ~ 5000
  2. Changed learning algorithm : Adam -> SGD

I does not have experience for PyTorch, but I think you should check PyTorch's default behaviour of internal network layer like conv, pooling, initializer, loss and so on. They all have different default parameters depending on the library.

Also, I noticed that you said "one-hot labels", but DeepDanbooru needs multi-label inputs. Not multi-class. DeepDanbooru's input vector should have multiple one-value (if the image has multiple tags).

KichangKim avatar Oct 23 '21 23:10 KichangKim

Hi, thanks for your timely reply. I think there is a little misunderstanding, sorry for my ambiguous description. I label each image with a paired one-hot vector with 7000 channels, setting 1 for existing tag and 0 for others. I think we are the same in this point according to your post in reddit two years ago. ex: 1girl, red_hair, black_hair,...1000 other tags...,blah 1,1,0,...,0

So you didn't make specific improvements on the networks or training strategy? Like assigning larger weights for low-frequency tags, adopting loss functions which are able to restrict the negative samples like focal loss, or resample the dataset to make the distribution more balanced?

I will check the difference of default settings, thank you!

tellurion-kanata avatar Oct 24 '21 07:10 tellurion-kanata

1girl, red_hair, black_hair,...1000 other tags...,blah 1,1,0,...,0

Oh, it looks fine.

Additionally, I filtered training dataset, using only images which has 20 or more general tags. Images which has < 20 general tags, simply ignored.

KichangKim avatar Oct 24 '21 08:10 KichangKim

Thanks you for your kind help! I would do the same filtering for my dataset before my next try.

tellurion-kanata avatar Oct 24 '21 09:10 tellurion-kanata

@ydk-tellurion you may can checkout this

code&datatsets prepare release: ShuhongChen/bizarre-pose-estimator

resnet50 trained on this subset classification better to RF5's

proposes a cleaner danbooru multi-label task specific on anime character, processing guide for danbooru2019 uninformative target tags: clean up by (positive tags, under tagged ,non-contextual relevance etc.) severe class imbalance: weighted trick, to reduced class imbalance/long-tailed issues on training : data augmentation strategy avoid certain lossly solve related image tasks using this backbone feature.

koke2c95 avatar Oct 24 '21 10:10 koke2c95

@YHJ2c95 Hi, thanks for your information! I would test it in the following days.

tellurion-kanata avatar Oct 24 '21 10:10 tellurion-kanata

@KichangKim @ydk-tellurion

hi

after survey danbooru's tag I think multi-label classification not a good

tag self with semantic, but is for human, as dataset is images bucket/collection

Concepts that one cannot describe / not presented , this serious effect, lead poorly trained models, few downstream task Or even, nothing learned at all perhaps add some pseudo label from unsupervised cluster could give huge improved

There are tags for non-original characters that are bias, i.e. character traits There are very few valid tags, (so why limit dan to 2, there are many "images" and two tags are enough to search) There is no information about the components of the tag, for example The tag set is not strongly delineated and there is repetition of meaning New tags are difficult to synchronise with earlier images

web image classification

I think the danbooru tagging task is a web image classification, which is very different from imagenet/CAPTCHA. One is that imagenet/CAPTCHA is very close to objects, whereas web images are not like that Secondly, if you remove the tagging literally, then the whole dataset should be just a bucket/collection, with sets of different themes of different intensities

Extend this further as a group, which can be subsetted and merged Apply contrast learning again, instead of using single image view's contrast, turn it into a group-to-group contrast

The inspiration is ( give a bunch of images and then guess the tag game ) and paper

koke2c95 avatar Nov 11 '21 16:11 koke2c95

Yes, simple multi-label classification has many limitations for semantic recognition. That is why I removed "copyright tags" from training data.

KichangKim avatar Nov 12 '21 01:11 KichangKim