EDA-NLP-Chinese icon indicating copy to clipboard operation
EDA-NLP-Chinese copied to clipboard

Easy Data Augmentation for NLP on Chinese

EDA-Chinese

Easy Data Augmentation for NLP on Chinese, based on EDA_NLP

1. Noted:

  1. Input file contains the sentences, one sentence one line
  2. Each sentence should be segmented before processing by EDA-Chinese, use space to split the segmentations.
  3. Each output sentence will be the sentence with some space to segment.
  4. Every line of input file must have a sentence or the script will crash down.
  5. Processing speed may be slow in the begining, but it will become more and more faster beacuse of the cache.

2. Requirements:

  1. Synonyms:
    pip install -U synonyms
    Download this package from pip maybe very slow, you can choose to download the source code from the GitHub and install it manually.
    python setup.py install

  2. Chinese Stop Words: Baidu stop words

3. How to use:

  1. create the data folder to hold the input sentences and augmentation sentences
    mkdir data

  2. Put your data into data folder, chinese sentence need to be segmented and split with space.

  3. Data Augmentation
    num_aug: number of augmented sentences per original sentence
    alpha: percent of words in each sentence to be changed, details can be found here

python augmentation.py --input ./data/<input_file> --output ./data/<output_file> --num_aug 5 --alpha 0.1

4. Refer:

  1. Synonyms
  2. EDA-blog
  3. EDA-Code

Any question please open the issue or send the emails to me which can be found on my GitHub homepage.