Jonas Wu

Results 17 comments of Jonas Wu

Hi, We run the code on the V100 with 32G memory. We find it needs around 24G generally, while for some videos containing a lot of objects, it will reach...

For the Transformer decoder, the decoder embedding is the pooled language feature, and the learnable queries are pos embedding. Please refer [here](https://github.com/wjn922/ReferFormer/blob/93c8ff5b14d35ab91a4894d0783d2964fd9072f7/models/deformable_transformer.py#L191).

Hi, I suppose you mean the GPU memory. We also use the GPUs with 32G memory, so we think there won't be a problem.

Hi, it is indeed the pretrained model is different joint training model. The pretrained models are only trained using Ref-COCO/+/g datasets in a image level (setting num_frmaes=1).

We do not use the pretrained model for joint trainnig. We do not adopt the balance sampling of RefCOCO/+/g and RefYTVOS, though their scales are different.

@zhenghao977 We use 32 V100 GPUs for the pretrained models. The total epoch is 12 and lr drops at the 8th and 10th epoch. The learning rate keeps the same...

Hi, The official website seems remove the test meta_expression recently. We upload the previous version of meta_expression [here](https://drive.google.com/file/d/1xjAwiPZColmGCKUYtMXO-Tc5Zzm1a-sJ/view?usp=sharing).

The inference needs around 24G memory. Since all the frames of a video will be used during inference while the training uses a clip of 5 frames, so it is...

Sorry for the very late reply. We upload the JHMDB dataset [here](https://drive.google.com/drive/folders/10EcgRQXQs-ZdBfDDuHLR-zcZo7f5hXbe?usp=sharing).

I have the same problem. In some cases, the dim of prediction would become 1d instead of 2d. How to solve this problem?