EDA-NLP-Chinese
EDA-NLP-Chinese copied to clipboard
Easy Data Augmentation for NLP on Chinese
EDA-Chinese
Easy Data Augmentation for NLP on Chinese, based on EDA_NLP
1. Noted:
- Input file contains the sentences, one sentence one line
- Each sentence should be segmented before processing by EDA-Chinese, use space to split the segmentations.
- Each output sentence will be the sentence with some space to segment.
- Every line of input file must have a sentence or the script will crash down.
- Processing speed may be slow in the begining, but it will become more and more faster beacuse of the cache.
2. Requirements:
-
Synonyms:
pip install -U synonyms
Download this package from pip maybe very slow, you can choose to download the source code from the GitHub and install it manually.
python setup.py install
-
Chinese Stop Words: Baidu stop words
3. How to use:
-
create the data folder to hold the input sentences and augmentation sentences
mkdir data
-
Put your data into
data
folder, chinese sentence need to be segmented and split with space. -
Data Augmentation
num_aug
: number of augmented sentences per original sentence
alpha
: percent of words in each sentence to be changed, details can be found here
python augmentation.py --input ./data/<input_file> --output ./data/<output_file> --num_aug 5 --alpha 0.1
4. Refer:
Any question please open the issue or send the emails to me which can be found on my GitHub homepage.