datasets
datasets copied to clipboard
Curated datasets for machine learning tasks according to use cases
awesome-datasets
Curated datasets for machine learning tasks according to use cases adapted from a now defunct article on Kaggle. Also check out this repo of winning solutions.
For each type of analysis think about:
- What problem does it solve and for who?
- How is it being solved today?
- What are the data inputs and where do they come from?
- What are the outputs and how are they consumed? Online models, static or dynamic reports?
- Is it a revenue leakage (“saves us money”) or a revenue growth (“makes us money”) problem?
Use Cases By Functions and Verticals
Marketing
Demand Forecasting
Forecast volumes of sales, inventory needed, etc.
- Rossman - Supermarket sales forecasting
- Online Product Sales - self-help product sales forecasting
Predicting Lifetime Value / Recency-Frequency Matrix
Identify the most lucrative and loyal segments of your customers
Churn / Up-sell
Identify characteristics and timing of customer churns/upgrades in order to prevent/encourage them
Customer Segmentation
Identify main customer clusters and their characteristics
- Instacart Market Basket Analysis
- Online Retail Dataset
- Loyal Customer Prediction - new customers from 11/11 event on Tmall
Product Grouping / Category Tree
Group products together in the most reasonable category trees
Cross-selling / Recommendation / Market Basket Analysis
Identify which products a customer is going to buy based on past purchases
Explicit Ratings
- MovieLens - Movie recommendation dataset
- Jester - Joke recommendation dataset
- Book-Crossings - Book recommendation dataset
- HetRec - Music recommendation dataset
Implicit Ratings
- Instacart Market Basket Analysis
- WikiLens - Wiki edits dataset
- OpenStreetMap - OpenStreetMap edits dataset
Channel Attribution and Optimization
Allocate credits fairly to all ads channels and have portfolio for your ads spending
- AnalyzeCore - Synthetic data and attribution models
Ad Optimization
Predict and price impressions, clicks, conversions or any performance metrics for ads
- Avazu Click-Through Rate Prediction - Mobile ads click-through-rate prediction
- Avito Demand Prediction Challenge - Predict demand for an online classified ad
Ad Fraud
Detect ad click/install frauds
- TalkingData AdTracking Fraud Detection Challenge - Can you detect fraudulent click traffic for mobile app ads?
Dynamic Pricing
Optimal price for growth, profit, customer retention, etc.
Store Layout Optimization
Optimal store/website layout for growth, profit, customer retention, etc.
Customer Feedback
Text classification to determine customer feedbacks/sentiment about your products
- IMDb - Movie reviews
- Amazon Reviews
- Yelp Open Dataset - Yelp reviews
- Wongnai Challenge - Restaurant reviews
- OpinRank Review Dataset - TripAdvisor and Edmunds Reviews
Customer Support
Question Answering
Generate natural language answers based on given context and questions
- SQuAD - Stanford Question Answering Dataset
Wait Time Prediction
Predict wait time based on customer history, time of day, call volumes, products owned, churn risk, LTV, etc.
Human Resources
Resume screening
Score candidates based on resumes and internal records
Employee Churn
Predicts which employees are most likely to leave
- SAS Employee Turnover - Synthetic employee churn dataset
- IBM HR Employee Attrition and Performance - Synthetic employee churn dataset
- Employee Attrition - Synthetic employee churn dataset
Healthcare
Medical Image Classification
Classify medical images according to conditions
- Grand Challenges - Collection of Biomedical Image Competitions
- MURA - Large Dataset for Abnormality Detection in Musculoskeletal Radiographs
- ISIC - International Skin Imaging Collaboration
- DermNet - Skin Disease Atlas
- TCIA - Cancer Imaging Archive
- OASIS - Longitudinal Neuroimaging Dataset
- DDSM - Digital Database for Screening Mammography
- Breast Histopathology Images
- NIH Chest X-rays
- HERLEV - Pap-smear Database
- Stanford Tissue Microarray Database
- CheXPert
- MIMIC-CXR
Readmission risk
Predict risk of re-admittance based on patient attributes, medical history, diagnose & treatment
Patient Report Summary
Generate natural language reports based on tabular data
Automated Triage
Classify patients according to their initial complaints
Hospital Operations Management
Optimize/predict operating theatre & bed occupancy based on initial patient visits
- Healthcare in Washington
- Mini Heritage Health Prize - Processed version of Heritage Health Prize dataset
Real-time Patient Monitoring
Activity monitoring of patients
- OPPORTUNITY - Dataset for Human Activity Recognition from Wearable, Object, and Ambient Sensors
- PAMAP2 - Physical Activity Monitoring Data Set
Survival Analysis
Predict survival rates of patients
- Haberman's Survival Data Set - Survival of patients who had undergone surgery for breast cancer
Dosage Effectiveness
Analyse effects of admitting different types and dosage of medication for a disease
Media
News Summary
Generate short length descriptions of news articles.
Insurance
Claim Prediction
Predict timing and size of claims
Claim Fraud
Outlier detection for insurance claim fraud
Policy Prediction
Predict type of insurance
Finance
Credit Scoring / Loan Approval / Debt Recovery
Predict which customers are going to default
- Statlog (German Credit Data) Data Set
- Statlog (Australian Credit Approval) Data Set
- Home Credit Default Risk
- A Fin tech fraud transaction classification - default prediction with anonymized features
Portfolio Optimization
Optimize portfolio of assets according to risks and returns
- quantmod - library for financial modeling in R; APIs for downloading fundamental and technical data
- Stanford EE103 - Popular ETFs from 2006 to 2016
Automated Trading
Trade financial assets using automated models
- quantmod - library for financial modeling in R; APIs for downloading fundamental and technical data
- Get Rich or Die Modelin' - Bitcoin trading signals
Fraud Detection
Identify fraudulent transactions and parties with outlier detection and network analysis
- Credit Card Fraud Detection - Anonymized features
- PaySim Synthetic Financial Datasets For Fraud Detection
- Bitcoin Transactions
Manufacturing
Quality Control
Detect malfunctioning pieces with computer vision
Process Optimization
Find bottlenecks in manufacturing processes
Warranty Analytics
Predict your products' rate and timing of failures
Design
Design new products
- Fashion MNIST - Labeled fashion images
Agriculture, Geography and Environment
Yield Forecasting
Forecast agricultural yields
Satellite Image Classification and Extraction
- Planet: Understanding the Amazon from Space
- SpaceNet - Annotated satellite images of buildings and roads
- Dstl Satellite Imagery Feature Detection
Air Quality
Wildlife Classification
Classify wild animals
- North American Camera Trap Images (NACTI) - images of trapped animals
Real Estate
Pricing
Predict real estate values based on their characteristics
Education
Automated Essay Scoring
Score essays based on past pieces
Utilities
Distribution Network Optimization
Optimize distribution networks of electricity, water, etc.