nxtp
nxtp copied to clipboard
Object Recognition as Next Token Prediction (CVPR 2024)
Object Recognition as Next Token Prediction
arXiv | Colab | Documentation | Hugging Face
Top 30 predictions with probabilities from our model on the image of "The Legend of Zelda: Tears of the Kingdom" [^1].
Introduction
This is the official PyTorch implementation for the paper Object Recognition as Next Token Prediction accepted at CVPR 2024 (Highlight).
@inproceedings{nxtp,
title = {{Object Recognition as Next Token Prediction}},
author = {Kaiyu Yue and Bor-Chun Chen and Jonas Geiping and Hengduo Li and Tom Goldstein and Ser-Nam Lim},
booktitle = {Computer Vision and Pattern Recognition Conference (CVPR)},
year = {2024}
}
Updates
May 26, 2024
- add ImageNet experiments: see src/imagenet
- visualize attention maps in decoder layers during inference: see examples
Mar 17, 2024
- release the best 1.78B model trained on G70M
- export onnx models: docs/onnx-export
Mar 03, 2024
- add examples with top-20 predictions to this readme
- add CLIP ViT- L/14 as the textual embedding model in evaluation metric (Table A.8 of the paper)
Method
This project delves into a fundamental problem in computer vision − object recognition − translating an image into object labels.
Linear models (such as ResNet) and contrastive models (such as CLIP) require predefined labels before inference, limiting their flexibility in real-world applications.
We extend W to cover the entire textual space using language models like LLaMA's 32K token embeddings. Our model predicts labels in a real-open manner through auto-regressive processing.
Additionally, our one-shot sampling technique enables efficient large-scale discriminative predictions, such as the top-100 labels.
The released models have 1.78B parameters. Truncating the model to 0.77B parameters still achieves competitive performance (Table 3 in the paper), which only has one transformer block in the decoder.
Examples
| Image w/ Top-20 Predictions | Attention Map | Image w/ Top-20 Predictions | Attention Map |
|---|---|---|---|
click to review [^1]prob: 0.13949 - legendprob: 0.12399 - skyprob: 0.04723 - cloudprob: 0.04642 - gameprob: 0.04500 - screenshotprob: 0.03189 - topprob: 0.03024 - mountainprob: 0.02262 - cliffprob: 0.01790 - worldprob: 0.01483 - wiiprob: 0.01440 - videoprob: 0.01310 - breathprob: 0.01087 - zeoprob: 0.00982 - zeldaprob: 0.00959 - characterprob: 0.00865 - rockprob: 0.00816 - linkprob: 0.00788 - islandprob: 0.00624 - adventureprob: 0.00591 - woman |
attention map infodecoder: layer 0: head 25 |
click to review [^2]prob: 0.23237 - rocketprob: 0.10435 - launchprob: 0.06144 - soyuzprob: 0.04314 - spaceprob: 0.03541 - smokeprob: 0.03249 - skyprob: 0.01971 - shuttleprob: 0.01566 - towerprob: 0.01551 - parisprob: 0.01229 - cloudprob: 0.01067 - padprob: 0.01050 - capeprob: 0.00983 - falconprob: 0.00956 - photoprob: 0.00834 - liftprob: 0.00814 - airprob: 0.00779 - missionprob: 0.00710 - stationprob: 0.00688 - julyprob: 0.00647 - satellite |
attention map infodecoder: layer 0: head 0 |
click to review [^3]prob: 0.30731 - dogprob: 0.13647 - sweaterprob: 0.11870 - hatprob: 0.06812 - scarfprob: 0.04131 - brickprob: 0.03114 - wallprob: 0.01796 - shirtprob: 0.01471 - cuteprob: 0.01156 - capprob: 0.00982 - neckprob: 0.00929 - topprob: 0.00797 - headprob: 0.00777 - beanieprob: 0.00658 - manprob: 0.00588 - sitsprob: 0.00582 - coatprob: 0.00524 - jacketprob: 0.00476 - collarprob: 0.00460 - faceprob: 0.00119 - bone |
attention map infodecoder: layer 0: head 25 |
click to review [^6]prob: 0.14861 - coffeeprob: 0.10409 - shopprob: 0.08065 - counterprob: 0.04603 - barprob: 0.04055 - restaurantprob: 0.03691 - insideprob: 0.03468 - areaprob: 0.02638 - storeprob: 0.02219 - tableprob: 0.01930 - interiorprob: 0.01347 - lotprob: 0.01156 - foodprob: 0.01058 - customerprob: 0.01001 - roomprob: 0.00923 - starbucksprob: 0.00853 - bakeryprob: 0.00738 - viewprob: 0.00738 - floorprob: 0.00733 - cafeprob: 0.00633 - shelf |
attention map infodecoder: layer 0: head 8 |
click to review [^3]prob: 0.47652 - monsterprob: 0.09664 - cartoonprob: 0.03812 - characterprob: 0.03724 - groupprob: 0.03312 - creatureprob: 0.02111 - cuteprob: 0.01929 - vectorprob: 0.01481 - animalprob: 0.00955 - artprob: 0.00924 - alienprob: 0.00837 - poseprob: 0.00604 - bubbleprob: 0.00553 - eyeprob: 0.00533 - colorprob: 0.00528 - handprob: 0.00477 - designprob: 0.00474 - wallpaperprob: 0.00462 - childprob: 0.00445 - peopleprob: 0.00445 - family |
attention map infodecoder: layer 2: head 7 |
click to review [^3]prob: 0.54375 - cloudprob: 0.09932 - wordprob: 0.07571 - skyprob: 0.03153 - letterprob: 0.01862 - soraprob: 0.01380 - logoprob: 0.00995 - textprob: 0.00715 - topprob: 0.00715 - blueprob: 0.00677 - titleprob: 0.00608 - photoprob: 0.00427 - pictureprob: 0.00288 - sonoraprob: 0.00269 - middleprob: 0.00257 - stormprob: 0.00202 - cloudscapeprob: 0.00190 - sunprob: 0.00189 - artprob: 0.00156 - soarprob: 0.00041 - icy |
attention map infodecoder: layer 1: head 13 |
click to review [^3]prob: 0.15317 - buildingprob: 0.13619 - waveprob: 0.04782 - roomprob: 0.03498 - middleprob: 0.03188 - hallprob: 0.02367 - peopleprob: 0.02135 - oceanprob: 0.02087 - floorprob: 0.01867 - worldprob: 0.01773 - insideprob: 0.01548 - manprob: 0.01380 - waterprob: 0.01205 - viewprob: 0.01200 - surferprob: 0.01109 - photoprob: 0.00798 - hotelprob: 0.00734 - cityprob: 0.00662 - poolprob: 0.00566 - artprob: 0.00319 - mural |
attention map infodecoder: layer 1: head 16 |
click to review [^3]prob: 0.25673 - birdprob: 0.21676 - featherprob: 0.18550 - peacockprob: 0.04251 - headprob: 0.03240 - blueprob: 0.02507 - pigeonprob: 0.02183 - tailprob: 0.01339 - hairprob: 0.01187 - topprob: 0.00677 - faceprob: 0.00631 - cameraprob: 0.00463 - beakprob: 0.00451 - eyeprob: 0.00419 - fenceprob: 0.00370 - sitsprob: 0.00333 - perchprob: 0.00330 - photoprob: 0.00318 - wallprob: 0.00269 - animalprob: 0.00106 - jay |
attention map infodecoder: layer 1: head 25 |
click to review [^5]prob: 0.07247 - tabletprob: 0.06770 - coffeeprob: 0.06562 - windowprob: 0.05829 - controllerprob: 0.05668 - gameprob: 0.04802 - switchprob: 0.04043 - wiiprob: 0.03798 - consoleprob: 0.03563 - cupprob: 0.02570 - topprob: 0.02067 - mugprob: 0.01808 - screenprob: 0.01344 - videoprob: 0.01105 - starprob: 0.01092 - nintendoprob: 0.01055 - computerprob: 0.00819 - marioprob: 0.00815 - remoteprob: 0.00736 - controlprob: 0.00393 - sill |
attention map infodecoder: layer 0: head 12 |
click to review [^4]prob: 0.36523 - airplaneprob: 0.09151 - cargoprob: 0.07531 - planeprob: 0.05538 - shipprob: 0.04223 - containerprob: 0.03105 - waterprob: 0.03040 - viewprob: 0.02277 - dockprob: 0.01685 - portprob: 0.01434 - skyprob: 0.01328 - shippingprob: 0.00788 - middleprob: 0.00751 - bodyprob: 0.00717 - photoprob: 0.00715 - jetprob: 0.00714 - cityprob: 0.00621 - oceanprob: 0.00615 - freightprob: 0.00609 - boatprob: 0.00320 - transportation |
attention map infodecoder: layer 2: head 14 |
click to review [^4]prob: 0.15236 - candyprob: 0.12271 - sweaterprob: 0.11457 - glassprob: 0.10593 - dogprob: 0.08311 - chairprob: 0.07111 - caneprob: 0.04701 - sunglassprob: 0.04589 - christmasprob: 0.02361 - costumeprob: 0.02085 - wearingprob: 0.01870 - hatprob: 0.00734 - headprob: 0.00636 - topprob: 0.00577 - outfitprob: 0.00520 - chocolateprob: 0.00437 - holiprob: 0.00362 - suitprob: 0.00344 - shirtprob: 0.00322 - strawberryprob: 0.00211 - wig |
attention map infodecoder: layer 1: head 16 |
click to review [^4]prob: 0.19960 - livingprob: 0.16291 - roomprob: 0.11353 - sofaprob: 0.06036 - couchprob: 0.04741 - rugprob: 0.04704 - coffeeprob: 0.03795 - dogprob: 0.03659 - wallprob: 0.02980 - tableprob: 0.01611 - floorprob: 0.01594 - greyprob: 0.01472 - woodprob: 0.01353 - furnitureprob: 0.01314 - plantprob: 0.01274 - fireplaceprob: 0.01161 - pillowprob: 0.00941 - chairprob: 0.00512 - homeprob: 0.00434 - blanketprob: 0.00351 - art |
attention map infodecoder: layer 1: head 16 |
Models
The following table shows the reproduced results of recall (R column in Table 1 of the paper) on the validation splits with top-10 predictions.
| # params | training group | checkpoint | md5 | CC3M | COCO | OpenImages |
|---|---|---|---|---|---|---|
| 1.78B | G3M | Hugging Face | b2a69b | 0.740 | 0.703 | 0.616 |
| 1.78B | G70M | Hugging Face | e177c7 | 0.721 | 0.765 | 0.662 |
Downloading
The checkpoints can be downloaded from the links in the table above. For downloading from Hugging Face, one option is to use git-lfs:
# install git lfs
git lfs install
# download the checkpoint in terminal
git clone https://huggingface.co/kaiyuyue/nxtp
Also, the checkpoint can be downloaded from the model page in the web browser.
Inference
There is an image assets/starbux.jpg for a quick test. First, please follow the instructions in Dependencies to prepare the environment.
To infer an image, please run
python src/infer.py \
--ckpt-path path/to/model/checkpoint \
--img-path assets/starbux.jpg \
--num-labels 20
The output from model trained on G3M will be
top-20 predictions:
| prob: 0.05742 - coffee
| prob: 0.05525 - restaurant
| prob: 0.04402 - shop
| prob: 0.02528 - room
| prob: 0.02468 - store
| prob: 0.02381 - interior
| prob: 0.01732 - area
| prob: 0.01640 - building
| prob: 0.01616 - food
| prob: 0.01408 - bar
| prob: 0.01247 - customer
| prob: 0.01134 - view
| prob: 0.01059 - floor
| prob: 0.01045 - table
| prob: 0.00933 - kitchen
| prob: 0.00926 - home
| prob: 0.00872 - look
| prob: 0.00841 - people
| prob: 0.00693 - cup
| prob: 0.00665 - counter
The output from model trained on G70M is
top-20 predictions:
| prob: 0.15203 - coffee
| prob: 0.09728 - shop
| prob: 0.09182 - counter
| prob: 0.03848 - interior
| prob: 0.03389 - bar
| prob: 0.03215 - restaurant
| prob: 0.02440 - table
| prob: 0.02245 - store
| prob: 0.01950 - area
| prob: 0.01905 - inside
| prob: 0.01590 - starbucks
| prob: 0.01313 - cafe
| prob: 0.01220 - chair
| prob: 0.01172 - floor
| prob: 0.01020 - cup
| prob: 0.00879 - drink
| prob: 0.00794 - room
| prob: 0.00746 - customer
| prob: 0.00635 - wood
| prob: 0.00345 - bakery
License
This project is under the CC-BY-NC 4.0 license. See LICENSE for details.
[^1]: Image credit: ゼルダの伝説 The Legend of Zelda: Tears of the Kingdom. [^2]: Image credit: Space-X. [^3]: Image credit: OpenAI Sora. [^4]: Image credit: Demo in Segment Anything | Meta AI. [^5]: Image credit: Super Mario Bros Wonder. [^6]: Image credit: Photo taken by the author at a Starbucks store.























