r-raster-vector-geospatial
r-raster-vector-geospatial copied to clipboard
change levels to unique in vector attributes lesson
In Explore and Plot by Vector Layer Attributes, the lesson is about seeing unique values and uses levels(lines_HARV$TYPE)
, which produces NULL
because the column is not defined as a factor. I would suggest unique(lines_HARV$TYPE)
instead.
I wonder if this is due to stringsAsFactors
being FALSE
by default in R > 4.0? #328
I think more than that one command would need to be changed because the surrounding text is all about factors and now lines_HARV$TYPE
is no longer a factor :(
Yes, I believe that @jsta is correct that this behavior is due to the change in the default value of stringsAsFactors
in R version 4.0. I was going to submit a quick pull request, but then I realized that there are some pedagogical choices that need to be made.
Just changing levels()
to unique()
will fix the NULL
output issue, but the larger problem is that there are several places in Episode 7 where lines_HARV$TYPE
is referred to as a factor, which leads to a brief discussion of factors. This problem also comes up in Episodes 8 and 10. It seems to me that there are at least two ways to fix this:
- Change
levels()
tounique()
in Episodes 7, 8, and 10, and update the exposition in Episodes 7 and 10 to remove any discussion of factors. - Convert the strings to factors, and leave the exposition (mostly) the same.
I'd be happy to take care of this, but I need some advice about which of these options to choose. My inclination would be to go with Option (1), as it will simplify the lesson a little, and there doesn't seem to be any reason to convert the strings to factors for the purposes of visualizing the data. However, if there was a specific pedagogical reason to include a review of factors in this lesson, then Option (2) would be preferable.
I like option 1 as well. I don't think we have any ggplot code that relies on factors that would be my only hesitation.
There is code that relies on the ordering of the factors. It still works if lines_HARV$TYPE
is a character variable, because (I believe that) ggplot
converts character variables to factors when they are used in aes()
. So changing levels()
to unique()
might be slightly confusing in places like the following:
First we will check how many unique values the TYPE field has:
unique(lines_HARV$TYPE) [1] "woods road" "footpath" "stone wall" "boardwalk"
Then we can create a palette of four colors, one for each feature in our vector object.
road_colors <- c("blue", "green", "navy", "purple")
We can tell ggplot to use these colors when we plot the data.
ggplot() + geom_sf(data = lines_HARV, aes(color = TYPE)) + scale_color_manual(values = road_colors) + labs(color = 'Road Type') + ggtitle("NEON Harvard Forest Field Site", subtitle = "Roads & Trails") + coord_sf()
The alert reader will notice that woods road
is not colored blue, as might be expected, because the road_colors
get assigned to the path types in factor (i.e., alphabetical) order, not in the order given by unique()
. The same problem happens later when customizing line widths.
So now I'm starting to lean toward Option 2. It is natural to want to customize the order of things in plots, and you can't do that without grappling with factors.
We can recover the pre-version 4.0 behavior by adding stringsAsFactors = TRUE
to all of the st_read
commands. This is probably the simplest fix, as it doesn't involve changing as much of the exposition, and it will eliminate the confusion of some learners using pre-4.0 versions.
Thanks all for picking up this issue. It seems like unique()
would be a quick and dirty fix, but would lead to issues later on. It would also be a good thing for learners now about using stringsAsFactors = TRUE
, since factors accidentally being treated as characters comes up in my own personal code all the time. I like @djhunter's explanation and solution.
After consideration, PR #353 seems like the "nuclear option" to me. It requires so much more typing on the learners' part. What about using unique
to list line types and aes(color = factor(column_name, levels = road_colors))
in the plotting commands?
Then still discuss factors but move it to a better spot somewhere just before factor-plotting.
What if we use options(stringsAsFactors = TRUE)
to replicate the pre R 4.0 default? This would allow users running 4.0 to experience the lesson the same as users running pre 4.0 R versions. Then we would not need to add the stringsAsFactors = TRUE
to each individual read command, which would reduce the amount of typing on the learner.
According to this post, the stringsAsFactors
global option will eventually be phased out, so setting it via the options
command could lead to errors later when the phaseout happens.
There are only three read commands in which learners would have to type stringsAsFactors = TRUE
: when reading HARV_roads.shp
, HARV_PlotLocations.csv
and hf001-06-daily-m.csv
. All of the other changes in pull request #353 are just repetitions of these, which presumably learners won't have to repeat if they maintain their environments between episodes.
We taught this lesson last month. stringsAsFactor = TRUE
was not a big deal.
What was a bigger deal was running out of memory in our RStudio hub environment.