Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification - [ICASSP 2025]

Welcome to the GitHub repository for Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification.

Authors:

K. El Khoury*, M. Zanella*, B. Gérin*, T. Godelaine*, B. Macq, S. Mahmoudi, C. De Vleeschouwer, I. Ben Ayed

*Denotes equal contribution

Updates

Paper accepted to ICASSP 2025. [December 20, 2024]
Paper uploaded on arXiv. [September 1, 2024]

We introduce RS-TransCLIP, a transductive approach inspired from TransCLIP, that enhances Remote Sensing Vison-Language Models without requiring any labels, only incurring a negligible computational cost to the overall inference time.

RS-TransCLIP results
Figure 1: Top-1 accuracy of RS-TransCLIP, on ViT-L/14 Remote Sensing Vision-Language Models, for zero-shot scene classification across 10 benchmark datasets.

Contents 📑

Setup
Datasets
User Manual
Citations
Contributing
Coming Soon

Setup 🔧

NB: the Python version used is 3.10.12.

Create a virtual environment and activate it:

# Example using the virtualenv package on linux
python3 -m pip install --user virtualenv
python3 -m virtualenv RS-TransCLIP-venv
source RS-TransCLIP-venv/bin/activate.csh

Install Pytorch:

pip3 install torch==2.2.2 torchaudio==2.2.2 torchvision==0.17.2

Clone GitHub and move to the appropriate directory:

git clone https://github.com/elkhouryk/RS-TransCLIP
cd RS-TransCLIP

Install the remaining Python packages requirements:

pip3 install -r requirements.txt

You are ready to start! 🎉

Datasets 🗂️

10 Remote Sensing Scene Classification datasets are already available for evaluation:

The WHURS19 dataset is already uploaded to the repository for reference and can be used directly.
The following 6 datasets (EuroSAT, OPTIMAL31, PatternNet, RESISC45, RSC11, RSICB256) will be automatically downloaded and formatted from Hugging Face using the run_dataset_download.py script.

# <dataset_name> can take the following values: EuroSAT, OPTIMAL31, PatternNet, RESISC45, RSC11, RSICB256
python3 run_dataset_download.py --dataset_name <dataset_name>

Dataset directory structure should be as follows:

$datasets/
└── <dataset_name>/
  └── classes.txt
  └── class_changes.txt
  └── images/
    └── <classname>_<id>.jpg
    └── ...

You must download the AID, MLRSNet and RSICB128 datasets manually from Kaggle and place them in '/datasets/' directory. You can format them manually to follow the dataset directory structure listed above and use them for evaluation OR you can use the run_dataset_formatting.py script by placing the .zip files from Kaggle in the '/datasets/' directory.

# <dataset_name> can take the following values: AID, MLRSNet, RSICB128
python3 run_dataset_formatting.py --dataset_name <dataset_name>

Download links: AID | RSICB128 | MLRSNet --- NB: On the Kaggle website, click on the download Arrow in the center of the page instead of the Download button to preserve the data structure needed to use the run_dataset_formatting.py_ script (check figure bellow).

Notes:

The class_changes.txt file inserts a space between combined class names. For example, the class name "railwaystation" becomes "railway station." This change is applied consistently across all datasets.

The WHURS19 dataset is already uploaded to the repository for reference.

User Manual 📘

Running RS-TransCLIP consist of three major steps:

Generating Image and Text Embeddings
Generating the Average Text Embedding
Running Transductive Zero-Shot Classification

We consider 10 scene classification datasets (AID, EuroSAT, MLRSNet, OPTIMAL31, PatternNet, RESISC45, RSC11, RSICB128, RSICB256, WHURS19), 4 VLM models (CLIP, GeoRSCLIP, RemoteCLIP, SkyCLIP50) and 4 model architectures (RN50, ViT-B-32, ViT-L-14, ViT-H-14) for our experiments.

Generating Image and Text Embeddings 🖼️📄

To generate Image embeddings for each dataset/VLM/architecture trio:

python3 run_featuregeneration.py --image_fg

To generate Text embeddings for each dataset/VLM/architecture trio:

python3 run_featuregeneration.py --text_fg

All results for each dataset/VLM/architecture trio will be stored as follows:

$results/
└── <dataset_name>/
  └── <model_name>
    └── <model_architecture>
      └── images.pt
      └── classes.pt
      └── texts_<prompt1>.pt
      └── ....
      └── texts_<prompt106>.pt

Notes:

Text embeddings will generate 106 individual text embeddings for each VLM/dataset combination, the exhaustive list of all text prompts can be found in run_featuregeneration.py.

When generating Image embeddings, the run_featuregeneration.py script will also generate the ground truth labels and store them in "classes.pt". These labels will be used for evaluation.

Please refer to run_featuregeneration.py to control all the respective arguments.

The embeddings for the WHURS19 dataset are already uploaded to the repository for reference.

Generating the Average Text Embedding ⚖️📄

To generate the Average Text embedding each dataset/VLM/architecture trio:

python3 run_averageprompt.py

Notes:

The run_averageprompt.py script will average out all embeddings with the following name structure "texts_*.pt" for each dataset/VLM/architecture trio and create a file called "texts_averageprompt.pt".

The Average Text embeddings for the WHURS19 dataset are already uploaded to the repository for reference.

Running Transductive Zero-Shot Classification ⚙️🚀

To run Transductive zero-shot classification using RS-TransCLIP:

python3 run_TransCLIP.py

Notes:

The run_TransCLIP.py script will use the Image embeddings "images.pt", the Average Text embedding "texts_averageprompt.pt" and the class ground truth labels "classes.pt" to run Transductive zero-shot classification using RS-TransCLIP.

The run_TransCLIP.py script will also generate the Inductive zero-shot classification for performance comparison.

Both Inductive and Transductive results will be stored in "results/results_averageprompt.csv".

The results for the WHURS19 dataset are already uploaded to the repository for reference.

RS-TransCLIP results table
Table 1: Top-1 accuracy for zero-shot scene classification without (white) and with (blue) RS-TransCLIP on 10 RS datasets.

Citations 📚

Support our work by citing our paper if you use this repository:

@inproceedings{el2025enhancing,
  title={Enhancing remote sensing vision-language models for zero-shot scene classification},
  author={El Khoury, Karim and Zanella, Maxime and G{\'e}rin, Beno{\^\i}t and Godelaine, Tiffanie and Macq, Beno{\^\i}t and Mahmoudi, Sa{\"\i}d and De Vleeschouwer, Christophe and Ben Ayed, Ismail},
  booktitle={ICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  pages={1--5},
  year={2025},
  organization={IEEE}
}

Please also consider citing the original TransCLIP paper:

@article{zanella2024boosting,
  title={Boosting vision-language models with transduction},
  author={Zanella, Maxime and G{\'e}rin, Beno{\^\i}t and Ben Ayed, Ismail},
  journal={Advances in Neural Information Processing Systems},
  volume={37},
  pages={62223--62256},
  year={2024}
}

For more details on transductive inference in VLMs, visit the TransCLIP comprehensive repository.

Contributing 🤝

Feel free to open an issue or pull request if you have any questions or suggestions.

You can also contact us by Email:

[email protected]
[email protected]
[email protected]
[email protected]

RS-TransCLIP
RS-TransCLIP copied to clipboard

Metadata

Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification - [ICASSP 2025]

Updates

Contents 📑

Setup 🔧

Datasets 🗂️

User Manual 📘

Generating Image and Text Embeddings 🖼️📄

Generating the Average Text Embedding ⚖️📄

Running Transductive Zero-Shot Classification ⚙️🚀

Citations 📚

Contributing 🤝

← Metadata

Owner

Metadata

RS-TransCLIP RS-TransCLIP copied to clipboard

Metadata

Enhancing Remote Sensing Vision-Language Models for Zero-Shot Scene Classification - [ICASSP 2025]

Updates

Contents 📑

Setup 🔧

Datasets 🗂️

User Manual 📘

Generating Image and Text Embeddings 🖼️📄

Generating the Average Text Embedding ⚖️📄

Running Transductive Zero-Shot Classification ⚙️🚀

Citations 📚

Contributing 🤝

← Metadata

Owner

Metadata

RS-TransCLIP
RS-TransCLIP copied to clipboard