beats-conformer-bart-audio-captioner icon indicating copy to clipboard operation
beats-conformer-bart-audio-captioner copied to clipboard

PyTorch implementation of the ICASSP-24 paper: "Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation"

Audio Captioning with BEATs, Conformer & BART

Winning model of DCASE Challenge 2023 Task 6A, with the follow-up publication:

  • Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation
    Shih-Lun Wu, Xuankai Chang, Gordon Wichern, Jee-weon Jung, François Germain, Jonathan Le Roux, and Shinji Watanabe
    Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP) 2024
    [arXiv page] [DCASE results]
  • BibTex citation
    @inproceedings{wu2024improving,
      title={Improving Audio Captioning Models with Fine-grained Audio Features, Text Embedding Supervision, and LLM Mix-up Augmentation},
      author={Wu, Shih-Lun and Chang, Xuankai and Wichern, Gordon and Jung, Jee-weon and Germain, Fran{\c{c}}ois and Le Roux, Jonathan and Watanabe, Shinji},
      booktitle={Proc. Int. Conf. on Acoustics, Speech and Signal Processing (ICASSP)},
      year={2024}
    }
    

Install Packages

  • (Recommended) Create Conda environment with Python 3.9
  • Install PyTorch with the correct CUDA version
  • Install dependencies for SPICE metric
    cd caption_evaluation_tools/coco_caption
    bash get_stanford_models.sh
    cd ../../
    
  • Install other dependencies
    pip install -r requirements.txt
    

Download Dataset & Pretrained Model

  • Install p7zip (required for unpacking dataset)
    # if using conda
    conda install bioconda::p7zip
    # if installing to system
    # sudo apt-get install p7zip-full
    
  • Download Clotho dataset
    bash download_clotho.sh
    
  • Install Git-LFS
    # if using conda
    conda install conda-forge::git-lfs
    git-lfs install
    
    # if installing to system
    # curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | sudo bash
    # sudo apt-get install git-lfs
    # git-lfs install
    
  • Get pretrained model (stored on HuggingFace)
    bash download_model.sh
    

Reproduce Best Model Results

  • Run inference & evaluation code
    bash run_sampling_reranking.sh
    
    • metrics can then be found at ./exp/inference_evaluation_nucleus_t0.5_p95/inference_metrics.json

(Bonus) Augmented Dataset

Our 50K mix-up caption augmentations generated by ChatGPT (see paper Section 2.3 for details) can be found at:

  • https://huggingface.co/datasets/slseanwu/clotho-chatgpt-mixup-50K

Acknowledgements

Our model/repository would not have been possible without the following great open-source works. Thank you so much!

  • Clotho dataset: https://zenodo.org/records/4783391
  • BEATs audio encoder: https://github.com/microsoft/unilm/tree/master/beats
  • INSTRUCTOR LM embeddings: https://github.com/xlang-ai/instructor-embedding
  • Evaluation tools
    • coco-caption: https://github.com/tylin/coco-caption
    • caption-evaluation-tools: https://github.com/audio-captioning/caption-evaluation-tools
    • fense: https://github.com/felixgontier/dcase-2023-baseline/tree/main/fense