ML-You-Can-Use

Practical Machine Learning and Natural Language Processing with examples.

Featuring

Interesting applications of ML, NLP, and Computer Vision
Practical demonstration notebooks
Reproducible experiments
Illustrated best practices:
- Code extracted from notebooks for:
  - automatic formatting with Black
  - Type checking via MyPy annotations
  - Linting via Pylint
  - Doctests whenever possible

Setup

Download this repo using git with the submodule command, e.g.:

git pull --recurse-submodules

Submodules are used to pull in some data and external data processing utilities that we'll use for preprocessing some of the data.

Install Python 3

Create Virtual Environment

mkdir p3
 `which python3` -m venv ./p3
 source setPythonHashSeed.sh
 source p3/bin/activate

Install Requirements

pip install -r requirements.txt

For running all notebook examples

pip install -r requirements-dev.txt

Note: some examples will have a conda `environment.yaml` file that you will want to use.

Installing Test Corpora

Many notebooks use data that needs to be installed, do so by running the install script.

install_corpora.sh

installs Python ssl certificates
installs CLTK data for Latin and Greek
installs NLTK data

Testing

./runUnitTests.sh

Interactivity

juypter notebook

Notebooks

Getting data

Extracting Occupation and Employer data from Wikidata

Labeling Data

Labeling occupation data with Wikipedia and GoogleNews
Correcting GoogleNews labels with Cleanlab
Training to label with BERT and Cleanlab

Modeling Language

Assessing Corpus Quality
Making a Frequency Distribution
Making a Word Trie Probability Model
Word and Sentence Probability using BERT
Comparing Collocation Extraction Methodologies

Detecting Duplicate Documents

Merge corpora by detecting and filtering duplicate documents

Classifying Texts

Benchmarking our classifier
Boostrapping Document Classification

Detecting Loanwords

Making a Frequency Distribution of Transliterated Greek
Boosting Training Data
The Problem of Loanwords, and a Solution
Feature Engineering with the Loanwords matrix
Detecting Loanwords with Keras

Wikipedia Corpus Processing

English Wikipedia Corpus Cleaning
English Wikipedia Corpus Processing
Latin Corpus Processing
Downsample or not

Quality Embeddings

Generating an English Wikipedia word vector
Generating a Latin word vector
The Case for Using an Embedding Encoder
Sentence Embeddings - A simple but effective baseline - using Seneca

Computer Vision - Object Detection

Object detection as a multivariable regression using a custom Convnet
Assessing the Noisy Circle detector

Summarizing Texts

Assessing Headline Generation

Searching and Search Relevance

Search Results Relevance using BERT

ML-You-Can-Use
ML-You-Can-Use copied to clipboard

Metadata

ML-You-Can-Use

Featuring

Setup

Install Python 3

Create Virtual Environment

Install Requirements

For running all notebook examples

Note: some examples will have a conda `environment.yaml` file that you will want to use.

Installing Test Corpora

Testing

Interactivity

Notebooks

Getting data

Labeling Data

Modeling Language

Detecting Duplicate Documents

Classifying Texts

Detecting Loanwords

Wikipedia Corpus Processing

Quality Embeddings

Computer Vision - Object Detection

Summarizing Texts

Searching and Search Relevance

References and Acknowledgements

← Metadata

Owner

Metadata

ML-You-Can-Use ML-You-Can-Use copied to clipboard

Metadata

ML-You-Can-Use

Featuring

Setup

Install Python 3

Create Virtual Environment

Install Requirements

For running all notebook examples

Note: some examples will have a conda environment.yaml file that you will want to use.

Installing Test Corpora

Testing

Interactivity

Notebooks

Getting data

Labeling Data

Modeling Language

Detecting Duplicate Documents

Classifying Texts

Detecting Loanwords

Wikipedia Corpus Processing

Quality Embeddings

Computer Vision - Object Detection

Summarizing Texts

Searching and Search Relevance

References and Acknowledgements

← Metadata

Owner

Metadata

ML-You-Can-Use
ML-You-Can-Use copied to clipboard

Note: some examples will have a conda `environment.yaml` file that you will want to use.