Xiongkun Linghu
Xiongkun Linghu
I run the default scripts, however, 100 generated samples take about 15 hours, is there any method to accelarate the process?
I have tried FCN pyramid structure 5,2,1 in the paper, but the accuracy on miniImagenenet is still 65.5%. Besides, acc on tiered imagenet is 70.6%(2% lower), 72.8%(2% lower) on CIFAR-FS,...
I pretrained the model and then used deep emd with the default setting, but the 5-way-1-shot accuracy was just 65.5%, which was 1% lower than the paper.
I use the edl loss to train in mini-imagenet dataset with 64 classes, but the loss can't converge and the accuracy is very low.
I want to caculate the perceptual distance for the specific input images, could you please provide the detailed impletation of this part?
The work is interesting. I want to train my model with your datasets. Could you please provide more detailed description of the datasets used in the Table 1 in the...
I want to generate samples interpolated by 2 images, is it covered by the codebase?
Thanks for sharing the work. I notice that the model can output coordinates of the 3D bounding boxes throught numerical values. How to access this data related to 3D grounding...
Thanks for your insteresting work. I visualize the grounded scene caption data and notice there is a key called 'all_phrases_positions'. What does it mean? I guess the numerical values represent...
I notice referent tokens are interleaved in the output. Can multiple referent tokens appear in a single text prompt, such as "Describe the table and the chair ."?