MINER
MINER copied to clipboard
MINER: Mutual Information based Named Entity Recognition
MINER: Mutual Information based Named Entity Recognition
This repository contains the code for our paper MINER: Improving Out-of-Vocabulary Named Entity Recognition from an Information Theoretic Perspective (ACL 2022).
Table of Contents
- Overview
- Installation
- Preparation
- Training
- Results
- Citation
Overview
NER models underperform when predicting entities that have not been seen during training, which is referred to as an Out-of-Vocabulary (OOV) problem. We propose MINER, a novel NER learning framework, to remedy this is- sue from an information-theoretic perspective.

Installation
Require python version >= 3.7, recommend install with pip
.
pip install -r requirements.txt
Preparation
Data Preprocessing
You need to put your data (train.txt, dev.txt, test.txt, labels.txt) in a folder . The data format refers to the CoNLL2003 dataset.
To generate new samples, we need to run the preprocessing script.
python ./scripts/cal_vocab.py --data_dir data_dir_path --tokenizer tokenizer_name
The script would output pmi.json
to your data_dir
, which contains important substring to prevent operations on entities when they are replaced.
python ./scripts/collect_entity.py --data_dir data_dir_path
The script would output entity.json
to your data_dir
, which contains entities and their neighbors.
Prepare Models
Bert, ALbert and Roberta can be applied directly as our encoders.
Training
Note that the beta and gama of different datasets and models are different, it is recommended to adjust slightly when applying other scenarios.
Here we give the parameters of Bert-base-uncased model, the performance will be further boosted when you use Bert-large. You can train and test in one click with the following commands.
# CoNLL2003
python -u main_conll.py --epoch 30 --do_train --batch_size 64 --gama 0.001 --beta 0.01 --gpu_id 0 --lr 0.00001 --switch_ratio 0.5 --data_dir data_dir_path --output_dir out_path
# TwitterNER
python -u main.py --epoch 50 --batch_size 64 --do_train --do_eval --do_predict --gama 0.0001 --beta 0.0001 --gpu_id 0 --lr 0.00001 --switch_ratio 0.5 --data_dir data_dir_path --output_dir out_path
# BioNER
python -u main.py --epoch 50 --batch_size 64 --do_train --do_eval --do_predict --gama 0.0001 --beta 0.001 --gpu_id 0 --lr 0.00001 --switch_ratio 0.5 --data_dir data_dir_path --output_dir out_path
# WNUT2017
python -u main.py --epoch 50 --batch_size 64 --do_train --do_eval --do_predict --gama 0.0001 --beta 0.0001 --gpu_id 0 --lr 0.00001 --switch_ratio 0.5 --data_dir data_dir_path --output_dir out_path
Results

Citation
If you are using MINER for your work, please kindly cite our paper:
@inproceedings{wang2022miner,
title={MINER: Improving Out-of-Vocabulary Named Entity Recognition from an Information Theoretic Perspective},
author={Wang, Xiao and Dou, Shihan and Xiong, Limao and Zou, Yicheng and Zhang, Qi and others},
booktitle={Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
pages={5590--5600},
year={2022}
}