data_cleaning
data_cleaning copied to clipboard
An SQL data cleaning project
data_cleaning
A repository of SQL data cleaning projects.
Introduction
This is a repo for small projects that can be used to practice data cleansing using SQL, Excel or any other method. This small project was inspired by a post made by Sushanta Khara on LinkedIn.
Project List:
Problem Statement
In Data Analysis, the analyst must ensure that the data is 'clean' before doing any analysis. 'Dirty' data can lead to unreliable, inaccurate and/or misleading results. Garbage in = garbage out.
These are the some steps that can be taken to properly prepare your dataset for analysis.
- Check for duplicate entries and remove them.
- Remove extra spaces and/or other invalid characters.
- Separate or combine values as needed.
- Ensure that certain values (age, dates...) are within certain range.
- Check for outliers.
- Correct incorrect spelling or inputted data.
- Adding new and relevant rows or columns to the new dataset.
- Check for null or empty values.
Using the criteria above, create a new SQL table with the properly formatted data.
Datasets used
This repository contains different projects/datasets to give the user many opportunities to practice:
- Basic select statements (select, where, group by, having).
- Aggregate functions (count, sum, min, max, avg)
- Joins (inner, outer, left, right)
- CTE's, temp tables and views
- string & date manipulation functions.
- Window functions (rank, lead, lag, row_number, ntile...)