Baifeng Shi
Baifeng Shi
Hi, yeah the paper only compares on segmentation. S2 uses avg pooling to resize the large-scale feature map to the regular size. To further reduce number of tokens, you can...
Hi, `mlp_downsample` will concat the adjacent 2x2 tokens into a single token. For the avg pooling, it's implemented inside S2. S2 will pool the feature map of a large-scale image...
Hi! @Lyken17 will follow up on this
Here's a full copy of FGVC data (including the train/test split json files) prepared according to the original instructions by the authors, if it helps: https://berkeley.box.com/shared/static/kt2sla80lmrmldiylltwva82tddr2jc3.tar
NVILA-Lite is designed is to optimize the efficiency over NVILA while maintaining a competitive performance. The main differences between NVILA-Lite and NVILA include that NVILA-Lite uses 3x3 downsample instead of...
Hi @Charlesliu77, could you point to which line of the code this issue happens at?
Hi, can you try replacing this line with ``` if all([feature.shape[0] == image_features[0].shape[0] for feature in image_features]): image_features = torch.stack(image_features, dim=0) ```
NVILA-Lite is designed is to optimize the efficiency over NVILA while maintaining a competitive performance. The main differences between NVILA-Lite and NVILA include that NVILA-Lite uses 3x3 downsample instead of...
Hi! NVILA-Lite is designed is to optimize the efficiency over NVILA while maintaining a competitive performance. The main differences between NVILA-Lite and NVILA include that NVILA-Lite uses 3x3 downsample instead...
Hi, please refer to #167 for details and we will update this in our next version of the paper.