Zero-Shot-Detection-via-Vision-and-Language-Knowledge-Distillation
Zero-Shot-Detection-via-Vision-and-Language-Knowledge-Distillation copied to clipboard
about training time
hi, @llrtt It seems that you have implemented vild image distillation via cropping proposals from original image & forward them to clip image encoder. Since every proposal is resized to be 224x224 resolution, it might be burdensome in terms of training time. How did you deal with it? How long did it take to fully train?