Baifeng Shi

Results 34 comments of Baifeng Shi

Hi, yeah the paper only compares on segmentation. S2 uses avg pooling to resize the large-scale feature map to the regular size. To further reduce number of tokens, you can...

Hi, `mlp_downsample` will concat the adjacent 2x2 tokens into a single token. For the avg pooling, it's implemented inside S2. S2 will pool the feature map of a large-scale image...

Hi! @Lyken17 will follow up on this

Here's a full copy of FGVC data (including the train/test split json files) prepared according to the original instructions by the authors, if it helps: https://berkeley.box.com/shared/static/kt2sla80lmrmldiylltwva82tddr2jc3.tar

NVILA-Lite is designed is to optimize the efficiency over NVILA while maintaining a competitive performance. The main differences between NVILA-Lite and NVILA include that NVILA-Lite uses 3x3 downsample instead of...

Hi @Charlesliu77, could you point to which line of the code this issue happens at?

Hi, can you try replacing this line with ``` if all([feature.shape[0] == image_features[0].shape[0] for feature in image_features]): image_features = torch.stack(image_features, dim=0) ```

NVILA-Lite is designed is to optimize the efficiency over NVILA while maintaining a competitive performance. The main differences between NVILA-Lite and NVILA include that NVILA-Lite uses 3x3 downsample instead of...

Hi! NVILA-Lite is designed is to optimize the efficiency over NVILA while maintaining a competitive performance. The main differences between NVILA-Lite and NVILA include that NVILA-Lite uses 3x3 downsample instead...

Hi, please refer to #167 for details and we will update this in our next version of the paper.