Data-Science-Sandbox icon indicating copy to clipboard operation
Data-Science-Sandbox copied to clipboard

Code and resources to serve as a starting point for data science projects.

The Data Science Codex

A collection of code and resources to serve as a starting point for data science projects. For more explanation and material on R visit my blog.

Notes

  • Resources - Websites and references that I find helpful for data science projects
  • Developing With R - Notes on R package development
  • How to Git - version control with git
  • How to Anaconda - managing environments with Anaconda

Data Visualization

  • Visualization Cookbook (R) - A wide variety of data visualizations demonstrated.
  • Geospatial Data Analysis (R) - Making maps with R.

Statistical Modeling and Machine Learning

  • Modeling Fundamentals (R) - A primer on logistic and linear regression modeling with the classic Titanic dataset.
  • Survival Analysis (R) - Survival analysis methods such as cox proportion hazard models and Kaplan-Meier curves.
  • Modeling Workflows (R) - Streamlined Tidyverse modeling workflows with the gapminder dataset.
  • Multilevel Models (R) - Multi-level aka. mixed effects models
  • Time Series Modeling (R) - Experimenting with time series modeling (tsibble, forecast libraries, prophet, etc.)
  • Ordinal Regression (R) - Experimenting with ordinal (ranked categorical outcome) regression
  • Presenting Regression Models (R) - Code for cleaning the outputs of regression models for presentations.
  • Sklearn Modeling Workflows (Python) - Modeling workflows with sklearn (cross-validation, randomized search for optimizing hyperparameters, lift curves).
  • Sklearn - Skopt Workflow (Python) - Modeling workflow with sklearn and scikit-optimize (bayesian hyperparameter optimization.
  • Machine Learning with Caret (R) - Using the Caret library for machine learning.
  • Parsnip (R) - fitting models with the parsnip package (from tidymodels)

Bayesian Models

  • Bayesian Basics (R) - exploring a simple Bayesian multilevel model
  • Bayesian Modeling (R) - Experimenting with Bayesian models using rstanarm
  • Comparing Bayesian Packages (R) - Comparing rstanarm, brms, and rstan.

Clustering

  • k-means clustering (R) - Using the k-means algorithm to cluster data.
  • Clustering (Python) - Agglomerative (Hierarchical) clustering, k-means clustering, and Gaussian mixture models

Stats Analysis

  • Power Analysis (R) - Statistical power analysis
  • Distribution Sampling and Hypothesis Testing (R)
  • Hypothesis Testing (R)

NLP

  • Document Embeddings (Python) - Using word embeddings to compare the similarity of State of the Union addresses.
  • State of the Union Analysis (Python) - An exploration of state of the union addresses with topic modeling and sentiment analysis.
  • Sentiment Analysis (R) - Exploring sentiment analysis in R.
  • LSTM Demo (Python) - An LSTM network for predicting if a company review from glassdoor is positive

Miscellaneous

  • R-Quickstart (R) - Minimal data analysis and visualization workflows. See the blog post "Data Science Essentials" for more details and explanation.
  • Creating Formatted Spreadsheets (R) - How to create a custom formatted spreadsheet report with the openxlsx R package.
  • Using Python and R Together - How to use python and R code together in the same Jupyter notebook with the rpy2 python package.
  • R Quotation (R) - If you want to do certain things such as pass variable names as arguments to a function in R, you have to use quotation methods like quo() and enquo(). This notebook demonstrates how to do this. See my blog post on Tidy Evaluation for more details and explanation.
  • SQL Databases (Python) - Code for creating and manipulating a SQL database.