EDA-NLP-Chinese
EDA-NLP-Chinese copied to clipboard

Published 20 hours ago •

→

Metadata

Easy Data Augmentation for NLP on Chinese

Readme
Issues

EDA-Chinese

Easy Data Augmentation for NLP on Chinese, based on EDA_NLP

1. Noted:

Input file contains the sentences, one sentence one line
Each sentence should be segmented before processing by EDA-Chinese, use space to split the segmentations.
Each output sentence will be the sentence with some space to segment.
Every line of input file must have a sentence or the script will crash down.
Processing speed may be slow in the begining, but it will become more and more faster beacuse of the cache.

2. Requirements:

Synonyms:
pip install -U synonyms
Download this package from pip maybe very slow, you can choose to download the source code from the GitHub and install it manually.
python setup.py install
Chinese Stop Words: Baidu stop words

3. How to use:

create the data folder to hold the input sentences and augmentation sentences
mkdir data
Put your data into data folder, chinese sentence need to be segmented and split with space.
Data Augmentation
num_aug: number of augmented sentences per original sentence
alpha: percent of words in each sentence to be changed, details can be found here

python augmentation.py --input ./data/<input_file> --output ./data/<output_file> --num_aug 5 --alpha 0.1

4. Refer:

Any question please open the issue or send the emails to me which can be found on my GitHub homepage.

About

Easy Data Augmentation for NLP on Chinese

17

Stars

1

Forks

Watchers

Owner

← Metadata

17

Stars

1

Forks

Watchers

Owner

Metadata

Easy Data Augmentation for NLP on Chinese