submitted by email for instructor training

Open ErinBecker opened this issue 8 years ago • 0 comments

Part of lesson to be improved for Carpentries Instructor checkout

Spatial Intro 07: Cleaning Data -- Missing and Bad Data Values

Link below:

#http://www.datacarpentry.org/r-spatial-data-management-intro/R/missing-bad-data

My suggestions for improving this lesson would be to incorporate dplyr package

into the filtering of the NA bad values for easier readability and introduction

or use of dplyr in R workflow for new or advanced R users. In addition, a great

and simple dataset riddled with NA and bad values is the HURDAT2 database for

all hurricane wind records (1851-2016) that has a built-in R package, "HURDAT"

for easy download for most R users. It can be used to introduce "%>%" operator

by filtering our data of interest for a hurricane or condition. It will show

an instance when to replace and not replace NA values with 0

Objective to add to the lesson: Understand how to identify, subset, and

alter NA values in a dataframe of spatial data

################ START OF LESSON ################

<= Double hash indicates text outside of code describing the lesson

<= Single hash indicates comment in code

Libraries

library(dplyr) library(HURDAT)

Improving the "Check for NA values" section below

Introduce dyplr with %>% to filter data for Hurricane Andrew

Load data and assign variable for all storms in the Atlantic basin

AL <- as.data.frame(get_hurdat(basin = "AL"))

Utilize %>% from dpylr package to input the AL dataframe into the filter

function to extract just the data for Hurricane Andrew with its key (a unique

identifier for every tropical cyclone)

andrew <- AL %>% filter(Key == 'AL041992')

Preview dataframe of Andrew where we see noticeable sets of NA values

head(andrew)

Check the number of NA values in the "Record" column that records when landfalls

occur with "L" with the sum() of is.na()

sum(is.na(andrew$Record))

This prints 47 NA values, indicating six landfalls. We can confirm this by

locating which rows labeled "L" with dplyr and select()

andrew %>% select(Record)

Improving the "Deal with NoData Values" section

Now let's look at identifying when to replace NA values with a number or just

leave them as NAs in the dataframe.

For our previous "Record" column, it would not make sense to label the

landfall record with a number. So, we can leave this unchanged. However,

there are times beyond the scope of this course where we might need to

"gapfill" the data, like plotting.

Let's look back at the NA values present in the wind data columns, NE34 to NW64.

Description of NA data: The NE34 to NW64 represents the forecast wind radii

for 34, 50, 64 knot winds for each quadrant (NE, SE, SW, NW) of the hurricane.

However, all of these values are NA since the National Hurricane Center (NHC)

did not do a post-analysis of them after the hurricane season until 2004.

While labelling NA values is good for keeping tracks of missing records, there are

instances where we can be certain of a subset of data where the values are 0.

For instance, if the maximum wind speed in the HURDAT2 data is less than 64 knots,

we can conclude that the "NE64" to "NW64" columns should be zero.

Let's select the rows where the maximum wind speed ("Wind" column) is < 64 knots

and replace the NA values of the "NE64" to "NW64" columns with 0.

We take the filtered dataframe and input it into the replace function where the

period indicates the dataframe

andrew_No_NA <- andrew %>% filter(Wind < 64) %>% select(Name,DateTime,NE64,SE64,SW64,NW64) %>% # select the relevant variables replace(., is.na(.), 0)

Check our new dataframe with no NA values

head(andrew_No_NA)

We have successfully replaced the NA values with 0 for a subset of Hurricane

Andrew's wind radii.

Oct 25 '17 16:10 ErinBecker