wrangl icon indicating copy to clipboard operation
wrangl copied to clipboard

Parallel data preprocessing for NLP and ML.

Wrangl

Tests

Parallel data preprocessing and fast experiments for NLP and ML. See docs here.

Why?

I built this library to prototype ideas quickly. In essence it combines Hydra, Pytorch Lightning, moolib, and Ray for some fast data processing and (supervised/reinforcement) learning. The following are supported with command line or config tweaks (e.g. no additional boilerplate code):

  • checkpointing
  • early stopping
  • auto git diffs
  • logging to S3 (along with auto-generated seaborn plot), wandb
  • Slurm launcher

Installation

pip install -e .  # add [dev] if you want to run tests and build docs.

# for latest
pip install git+https://github.com/vzhong/wrangl

# pypi release
pip install wrangl

If moolib install fails because you do not have CUDA you can try installing it yourself with env USE_CUDA=0 pip install moolib.

Usage

See the documentation for how to use Wrangl. Examples of projects using Wrangl are found in wrangl.examples. In particular wrangl.examples.learn.xor_clf shows an example of using Wrangl to quickly set up a supervised classification task. wrangl.examples.learn.atari_rl shows an example of reinforcement learning using IMPALA VTrace. For parallel data preprocessing wrangl.examples.preprocess.using_stanza shows an example of using Stanford NLP Stanza to parse text in parallel across CPU cores.

If you find this work helpful, please consider citing

@misc{zhong2021wrangl,
  author = {Zhong, Victor},
  title = {Wrangl: Parallel data preprocessing for NLP and ML},
  year = {2021},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/vzhong/wrangl}}
}