Baifeng Shi comments

Results 34 comments of


                                            Baifeng Shi

Hi, Have you compare with s2 [384, 768] scales versus interpolate to 768x768?

Hi, yeah the paper only compares on segmentation. S2 uses avg pooling to resize the large-scale feature map to the regular size. To further reduce number of tokens, you can...

Hi, Have you compare with s2 [384, 768] scales versus interpolate to 768x768?

Hi, `mlp_downsample` will concat the adjacent 2x2 tokens into a single token. For the avg pooling, it's implemented inside S2. S2 will pool the feature map of a large-scale image...

Is there any news about the serving scripts updates?

Hi! @Lyken17 will follow up on this

Dropbox / Google Drive are not available

Here's a full copy of FGVC data (including the train/test split json files) prepared according to the original instructions by the authors, if it helps: https://berkeley.box.com/shared/static/kt2sla80lmrmldiylltwva82tddr2jc3.tar

Different versions of NVILA

NVILA-Lite is designed is to optimize the efficiency over NVILA while maintaining a competitive performance. The main differences between NVILA-Lite and NVILA include that NVILA-Lite uses 3x3 downsample instead of...

Stack error

Hi @Charlesliu77, could you point to which line of the code this issue happens at?

Stack error

Hi, can you try replacing this line with ``` if all([feature.shape[0] == image_features[0].shape[0] for feature in image_features]): image_features = torch.stack(image_features, dim=0) ```

Stack error

what is the difference between nvila and nvila-lite version

Hi! NVILA-Lite is designed is to optimize the efficiency over NVILA while maintaining a competitive performance. The main differences between NVILA-Lite and NVILA include that NVILA-Lite uses 3x3 downsample instead...

what is the difference between "NVILA-Lite", "NVILA" and "NVILA-video"?

Hi, please refer to #167 for details and we will update this in our next version of the paper.