Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency

This is the official GitHub page for the papers:

Eric Müller-Budack, Jonas Theiner, Sebastian Diering, Maximilian Idahl und Ralph Ewerth. „Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency“. In: International Conference on Multimedia Retrieval, ICMR 2020, Dublin, Ireland, June 8-11, 2020. ACM, 2020, S. 16–25. DOI: https://doi.org/10.1145/3372278.3390670

Eric Müller-Budack, Jonas Theiner, Sebastian Diering, Maximilian Idahl, Sherzod Hakimov und Ralph Ewerth. „Multimodal news analytics using measures of cross-modal entity and context consistency“. In: International Journal of Multimedia Information Retrieval 10.2 (2021), Springer, S. 111–125. DOI: https://doi.org/10.1007/s13735-021-00207-4

Supplemental Material

You can find the supplemental material here: supplemental_material

News

3rd June 2020:

Pre-release of the TamperedNews and News400 dataset with links to news texts, news images, untampered and tampered entity sets, and reference images for all entities.
Splits for validation and testing
Download script to crawl news texts

4th June 2020:

Full release of the TamperedNews and News400 dataset including the visual and textual features used in the paper
Inference scripts and config files including the parameters used in the paper to reproduce the results for context and entity verification.

5th June 2020:

Download script that automatically generates the whole dataset with the intended project structure
Docker container
Source code for textual feature extraction

22nd June 2020:

Image crawler to obtain the news and reference images.

24th June 2020:

Source code for visual feature extraction

13th January 2021:

Added instructions to run docker with GPU support

16th July 2021:

Added crawler to download reference images from bing
Added functions for named entity recognition and linking
Added inference script and examples (Link)

17th May 2022

Release of the updated version of both datasets TamperedNews and News400
Updated code including the VisE event classification approach according to the IJMIR'21 paper

Content

This repository contains links to the TamperedNews (Link) and News400 (Link) datasets. Both datasets include:

Content

For both datasets TamperedNews and News400, we provide the:

<dataset>.tar.gz containing the <dataset>.jsonl with:
- Web links to the news texts
- Web links to the news image
- Outputs of the named entity recognition and disambiguation (NERD) approach
- Untampered and tampered entities for each document
<dataset>\_features.tar.gz with visual features for events, locations, and persons
<dataset>\_wordembeddings.tar.gz: Word embeddings of all nouns in the news texts

For all entities detected in both datasets, we provide:

entities.tar.gz containing an <entity_type>.jsonl for all entity types (events, locations, and persons) with:
- Wikidata ID
- Wikidata label
- Meta information used for tampering
- Web links to all reference images crawled from Google, Bing, and Wikidata
entities_features.tar.gz containing the visual features of the reference images for all entities

The datasets can also be found here: https://doi.org/10.25835/0084897

We also provide source code and config files to reproduce our results:

eval_benchmark.py to reproduce the paper results
Config files including the parameters used for the experiments in the paper

Installation

We have provided a Docker container to execute our code. You can build the container with:

docker build <PATH/TO/REPOSITORY> -t <DOCKER_NAME>

To run the container please use:

docker run \
  --volume <PATH/TO/REPOSITORY>:/src \
  -u $(id -u):$(id -g) \
  -it <DOCKER_NAME> bash

cd /src

Add the flag --gpus all to the docker run command to run the code on your GPUs. For detailed instructions please follow: https://wiki.archlinux.org/index.php/Docker#Run_GPU_accelerated_Docker_containers_with_NVIDIA_GPUs

Inference

Please download (and unpack) the models for the utilized deep learning models from the following links and place them in the respective directories of the project:

resources
- event_classification model download
  - models
    - VisE_CO_cos.pt
    - VisE_CO_cos.yml
  - VisE-O_refined.graphml
- facenet model download
  - 20180402-114759.pb
  - model-20180402-114759.ckpt-275.data-00000-of-00001
  - model-20180402-114759.ckpt-275.index
  - model-20180402-114759.meta
- geolocation_estimation model download
  - cfg.json
  - model.ckpt.data-00000-of-00001
  - model.ckpt.index
  - model.ckpt.meta
- scene_classification model download
  - resnet50_places365.pth.tar

Please run the following command to apply the approach to a self-defined image-text pair:

python infer.py \
  --config <PATH/TO/config.yml> \
  --text <PATH/TO/textfile.txt> \
  --image <PATH/TO/imagefile.jpg> \
  --wikifier_key <YOUR_WIKIFIER_API_KEY>

A Wikifier API key can be obtained by registering at http://www.wikifier.org/register.html.

You can specify the language with: --language [en, de] (en is default)

Two examples for testing and configs for the ICMR'20 and IJMIR'21 publications are provided in examples

An example on how to run the code can be found below:

python infer.py \
  --config examples/config_ijmir21.yml \
  --text examples/Second_inauguration_of_Barack_Obama.txt \
  --image examples/Second_inauguration_of_Barack_Obama.jpg \
  --wikifier_key <YOUR_WIKIFIER_API_KEY>

We recommend using the config examples/config_ijmir21.yml of our latest approach presented in IJMIR'21.

Please note that the icrawler used to download reference images from the web, does not currently support crawling via Google Images. Instead 20 reference images are retrieved from Bing.

Build Dataset

You can use the script provided in this repository to download and build the dataset:

python build_dataset.py

This will automatically create a folder resources in the project containing the required data to execute the following steps.

Reproduce Paper Results

We have provided all necessary meta information and features to reproduce the results reported in the paper. This step requires to download all the dataset as described in Build Dataset. In case you have modified the dataset paths, please specify the correct paths to the features, splits, etc. in the corresponding config files.

To reproduce the paper results, please run:

python eval_benchmark.py --config test_yml/<config>.yml

or use the scripts provided in the experiments folder

The number of parallel threads can be defined with: --threads <#THREADS>

Build your own Models

We provide code to download news texts, images, and reference images to allow building your own system based on our datasets. In addition, the source code to extract textual and visual features used in our paper is provided.

Download News Texts

The following command automatically downloads the text of the news articles:

python download_news_text.py \
  --input <PATH/TO/dataset.jsonl>
  --output <PATH/TO/OUTPUT/DIRECTORY>
  --dataset <DATASET=[TamperedNews, News400]>

Additional parameters: Run the script with --debug to enable debugging console outputs. The number of parallel threads can be defined with: --threads <#THREADS>

Outputs: This step stores a variety of meta information for each article with an ID document_ID in a file: <document_ID>.json. In addition, the news texts are stored along with all other dataset information in a new file: dataset_with_text.jsonl.

Tip: The script checks whether an article has already been crawled. We recommend running the script several times as some documents might be missing due to timeouts in earlier iterations.

Known Issues: We are aware that some Websites have changed the news content or their overall template. For this reason, the texts can differ from our dataset. Please contact us ([email protected]) for further information.

Download Images

The following command automatically downloads the images of news articles or reference images for the entities found in the dataset:

python download_images.py \
  --input <PATH/TO/INPUT.jsonl> \
  --output <PATH/TO/OUTPUT/DIRECTORY> \
  --type <TYPE=[news, entity]>

Additional parameters:

Run the script with --debug to enable debugging console outputs. You can set the dimension of the smaller image dimension to a maximal size using --size <SIZE>. The number of parallel threads can be defined with: --threads <#THREADS>

To download the news images provide the path to the dataset.jsonl and run the script with --type news.

To download the references images of the entities found in the dataset, please provide the path to the respective <entity_type>.jsonl and run the script with --type entity

Extraction of Textual Features

If you haven't executed build_dataset.py, it is required to download the fastText models for English or German for TamperedNews and News400, respectively. Put both models in the same folder fasttext_folder. The standard folder is resources/fasttext

You can extract the textual features of the news text using:

python calculate_word_embeddings.py \
  --dataset <PATH/TO/dataset_with_text.jsonl> \
  --fasttext <PATH/TO/fasttext_folder> \
  --output <PATH/TO/OUTPUTFILE.h5>

Extraction of Visual Features

Please download (and unpack) the models as described in Inference

You can extract the visual features of the images downloaded according to Download Images using:

python calculate_image_embeddings.py \
  --input <PATH/TO/INPUT.jsonl> \
  --directory <PATH/TO/DOWNLOAD/FOLDER> \
  --model <PATH/TO/MODEL/FOLDER \
  --type <TYPE=[news, entity]> \
  --output <PATH/TO/OUTPUTFILE.h5>

Please note, that the path provided with --directory needs to match the output directory specified in Download Images.

To generate the scene probabilities for all 365 Places2 categories, set the flag --logits

Additional parameters: Run the script with --debug to enable debugging console outputs. Set the flag --cpu to generate the embeddings using a CPU.

Credits: We thank all the original authors for their work. The corresponding GitHub repositories are linked here:

https://github.com/CSAILVision/places365
https://github.com/davidsandberg/facenet
https://github.com/TIBHannover/GeoEstimation
https://github.com/TIBHannover/VisE

LICENSE

This work is published under the GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007. For details please check the LICENSE file in the repository.

cross-modal_entity_consistency
cross-modal_entity_consistency copied to clipboard

Metadata

Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency

Supplemental Material

News

3rd June 2020:

4th June 2020:

5th June 2020:

22nd June 2020:

24th June 2020:

13th January 2021:

16th July 2021:

17th May 2022

Content

Content

Installation

Inference

Build Dataset

Reproduce Paper Results

Build your own Models

Download News Texts

Download Images

Extraction of Textual Features

Extraction of Visual Features

LICENSE

← Metadata

Owner

Metadata

cross-modal_entity_consistency cross-modal_entity_consistency copied to clipboard

Metadata

Multimodal Analytics for Real-world News using Measures of Cross-modal Entity Consistency

Supplemental Material

News

3rd June 2020:

4th June 2020:

5th June 2020:

22nd June 2020:

24th June 2020:

13th January 2021:

16th July 2021:

17th May 2022

Content

Content

Installation

Inference

Build Dataset

Reproduce Paper Results

Build your own Models

Download News Texts

Download Images

Extraction of Textual Features

Extraction of Visual Features

LICENSE

← Metadata

Owner

Metadata

cross-modal_entity_consistency
cross-modal_entity_consistency copied to clipboard