organization-geospatial-DEPRECATED
organization-geospatial-DEPRECATED copied to clipboard
submitted by email for instructor training
Part of lesson to be improved for Carpentries Instructor checkout
Spatial Intro 07: Cleaning Data -- Missing and Bad Data Values
Link below:
#http://www.datacarpentry.org/r-spatial-data-management-intro/R/missing-bad-data
My suggestions for improving this lesson would be to incorporate dplyr package
into the filtering of the NA bad values for easier readability and introduction
or use of dplyr in R workflow for new or advanced R users. In addition, a great
and simple dataset riddled with NA and bad values is the HURDAT2 database for
all hurricane wind records (1851-2016) that has a built-in R package, "HURDAT"
for easy download for most R users. It can be used to introduce "%>%" operator
by filtering our data of interest for a hurricane or condition. It will show
an instance when to replace and not replace NA values with 0
Objective to add to the lesson: Understand how to identify, subset, and
alter NA values in a dataframe of spatial data
################ START OF LESSON ################
<= Double hash indicates text outside of code describing the lesson
<= Single hash indicates comment in code
Libraries
library(dplyr) library(HURDAT)
Improving the "Check for NA values" section below
Introduce dyplr with %>% to filter data for Hurricane Andrew
Load data and assign variable for all storms in the Atlantic basin
AL <- as.data.frame(get_hurdat(basin = "AL"))
Utilize %>% from dpylr package to input the AL dataframe into the filter
function to extract just the data for Hurricane Andrew with its key (a unique
identifier for every tropical cyclone)
andrew <- AL %>% filter(Key == 'AL041992')
Preview dataframe of Andrew where we see noticeable sets of NA values
head(andrew)
Check the number of NA values in the "Record" column that records when landfalls
occur with "L" with the sum() of is.na()
sum(is.na(andrew$Record))
This prints 47 NA values, indicating six landfalls. We can confirm this by
locating which rows labeled "L" with dplyr and select()
andrew %>% select(Record)
Improving the "Deal with NoData Values" section
Now let's look at identifying when to replace NA values with a number or just
leave them as NAs in the dataframe.
For our previous "Record" column, it would not make sense to label the
landfall record with a number. So, we can leave this unchanged. However,
there are times beyond the scope of this course where we might need to
"gapfill" the data, like plotting.
Let's look back at the NA values present in the wind data columns, NE34 to NW64.
Description of NA data: The NE34 to NW64 represents the forecast wind radii
for 34, 50, 64 knot winds for each quadrant (NE, SE, SW, NW) of the hurricane.
However, all of these values are NA since the National Hurricane Center (NHC)
did not do a post-analysis of them after the hurricane season until 2004.
While labelling NA values is good for keeping tracks of missing records, there are
instances where we can be certain of a subset of data where the values are 0.
For instance, if the maximum wind speed in the HURDAT2 data is less than 64 knots,
we can conclude that the "NE64" to "NW64" columns should be zero.
Let's select the rows where the maximum wind speed ("Wind" column) is < 64 knots
and replace the NA values of the "NE64" to "NW64" columns with 0.
We take the filtered dataframe and input it into the replace function where the
period indicates the dataframe
andrew_No_NA <- andrew %>% filter(Wind < 64) %>% select(Name,DateTime,NE64,SE64,SW64,NW64) %>% # select the relevant variables replace(., is.na(.), 0)
Check our new dataframe with no NA values
head(andrew_No_NA)