cmhc icon indicating copy to clipboard operation
cmhc copied to clipboard

Survey Zones naming over time

Open bdbmax opened this issue 3 years ago • 2 comments

Hello!

By getting the data through get_cmhc from different year, the naming of what seems to be the same survey zone can differ over time; here's an example.

plateau <- lapply(2015:2016, \(yr) {
  out <- cmhc::get_cmhc(survey = "Rms",
                        series = "Vacancy Rate",
                        dimension = "Rent Ranges",
                        breakdown = "Survey Zones",
                        geo_uid = 24462,
                        year = yr)
  out$`Survey Zones`[grepl("^Plateau", out$`Survey Zones`)]
})

print(unique(do.call(c, plateau)))

Output: [1] "Plateau Mont-Royal" "Plateau-Mont-Royal"

Naming for le Plateau in Montreal changes overtime. Before 2015 (included), there was no hyphen, and after 2015, the hyphen appeared. I believe this is the same zone, but there's no way to really be sure? From the description of the get_cmhc_geography function, it's stated that the geographic data corresponds to an extract from 2017, and that it won't necessary match regions from other years. Could a year argument be added to the get_cmhc_geography function, letting us match names to spatial polygon for every individual year? And then year over year we could match the actual zones rather than names that might differ from a single string (in the hypothetical case that this is indeed the same survey zone).

Here is another example of names differing in the data, and a zone disappearing in some years:

st_lin <- lapply(2016:2021, \(yr) {
  out <- cmhc::get_cmhc(survey = "Rms",
                        series = "Vacancy Rate",
                        dimension = "Rent Ranges",
                        breakdown = "Survey Zones",
                        geo_uid = 24462,
                        year = yr)
  out$`Survey Zones`[grepl("^Saint-Lin", out$`Survey Zones`)]
})

print(st_lin)

Output: 
[[1]]
character(0)

[[2]]
[1] "Saint-Lin\u0096Laurentides V" "Saint-Lin\u0096Laurentides V"
[3] "Saint-Lin\u0096Laurentides V" "Saint-Lin\u0096Laurentides V"
[5] "Saint-Lin\u0096Laurentides V" "Saint-Lin\u0096Laurentides V"
[7] "Saint-Lin\u0096Laurentides V"

[[3]]
character(0)

[[4]]
character(0)

[[5]]
character(0)

[[6]]
[1] "Saint-Lin-Laurentides V" "Saint-Lin-Laurentides V" "Saint-Lin-Laurentides V"
[4] "Saint-Lin-Laurentides V" "Saint-Lin-Laurentides V" "Saint-Lin-Laurentides V"
[7] "Saint-Lin-Laurentides V"

Maybe the zone just has a different naming in some years?

I think getting the survey zones geography for every year, if at all possible, would be the best way to fix these non-matching namings. These zones also have a METZONE_UID in the output of the get_cmhc_geography, which would help idenfity the zone coming from the data to the spatial zone, if that code was also in the output of the get_cmhc. But having seen the content of the httr::POST call, I understand there's only a name in that table to identify the zone; and as stated, this name isn't constant over years.

I understand CMHC data isn't super easy to work with! From your experience working with it, do you see a possibility to solve this problem? The only thing I can think of is either get spatial polygons of zones for every year (which would be very reliable), or merging years of data with names using the closest string match (less reliable).

Thanks !

bdbmax avatar Oct 13 '22 15:10 bdbmax