ML-You-Can-Use icon indicating copy to clipboard operation
ML-You-Can-Use copied to clipboard

Practical ML and NLP with examples.

ML-You-Can-Use

Build Status CircleCI codecov.io

Practical Machine Learning and Natural Language Processing with examples.

Featuring

  • Interesting applications of ML, NLP, and Computer Vision
  • Practical demonstration notebooks
  • Reproducible experiments
  • Illustrated best practices:
    • Code extracted from notebooks for:
      • automatic formatting with Black
      • Type checking via MyPy annotations
      • Linting via Pylint
      • Doctests whenever possible

Setup

Download this repo using git with the submodule command, e.g.:

git pull --recurse-submodules

Submodules are used to pull in some data and external data processing utilities that we'll use for preprocessing some of the data.

Install Python 3

Create Virtual Environment

mkdir p3
 `which python3` -m venv ./p3
 source setPythonHashSeed.sh
 source p3/bin/activate

Install Requirements

pip install -r requirements.txt

For running all notebook examples

pip install -r requirements-dev.txt

Note: some examples will have a conda environment.yaml file that you will want to use.

Installing Test Corpora

Many notebooks use data that needs to be installed, do so by running the install script.

install_corpora.sh

  • installs Python ssl certificates
  • installs CLTK data for Latin and Greek
  • installs NLTK data

Testing

./runUnitTests.sh

Interactivity

juypter notebook

Notebooks

Getting data

  • Extracting Occupation and Employer data from Wikidata

Labeling Data

  • Labeling occupation data with Wikipedia and GoogleNews
  • Correcting GoogleNews labels with Cleanlab
  • Training to label with BERT and Cleanlab

Modeling Language

  • Assessing Corpus Quality
  • Making a Frequency Distribution
  • Making a Word Trie Probability Model
  • Word and Sentence Probability using BERT
  • Comparing Collocation Extraction Methodologies

Detecting Duplicate Documents

  • Merge corpora by detecting and filtering duplicate documents

Classifying Texts

  • Benchmarking our classifier
  • Boostrapping Document Classification

Detecting Loanwords

  • Making a Frequency Distribution of Transliterated Greek
  • Boosting Training Data
  • The Problem of Loanwords, and a Solution
  • Feature Engineering with the Loanwords matrix
  • Detecting Loanwords with Keras

Wikipedia Corpus Processing

  • English Wikipedia Corpus Cleaning
  • English Wikipedia Corpus Processing
  • Latin Corpus Processing
  • Downsample or not

Quality Embeddings

  • Generating an English Wikipedia word vector
  • Generating a Latin word vector
  • The Case for Using an Embedding Encoder
  • Sentence Embeddings - A simple but effective baseline - using Seneca

Computer Vision - Object Detection

  • Object detection as a multivariable regression using a custom Convnet
  • Assessing the Noisy Circle detector

Summarizing Texts

  • Assessing Headline Generation

Searching and Search Relevance

  • Search Results Relevance using BERT

References and Acknowledgements