ActivityStartDateTime, time zones, and offset cols
Is your feature request related to a problem? Please describe:
The time zone associated with ActivityStartDateTime is not clear because it is not included in the df returned from dataRetrieval:::create_dateTime, which is run within dataRetrieval functions called within TADA_DataRetrieval and separately within TADA_AutoClean when ActivityStartDateTime is missing from the input df.
Describe the solution you'd like:
I think it might be more user friendly to include a column titled ActivityStartDateTime.TimeZoneCode (UTC in this case) instead of the ActivityStartTime.TimeZoneCode_offset (which includes number of hours). As is, the target time zone for ActivityStartDateTime (a function input here) is not documented anywhere in the returned df (see review_TADAProfile1 below).
Describe alternatives you've considered:
Alternatively, UTC could potentially be included in ActivityStartDateTime but that might break people workflows (e.g. "2023-05-11 11:45:00 UTC").
Additional context:
Regarding dataRetrieval: The internal code is here; https://github.com/DOI-USGS/dataRetrieval/blob/main/R/importWQP.R#L223 (you can call it, you'd just need to do a triple colon: dataRetrieval:::create_dateTime)
offsetLibrary is a dataframe saved in sysdata.rda You can see where and how it gets called here: https://github.com/DOI-USGS/dataRetrieval/blob/main/R/importWQP.R#L160
Review_TADAProfile1 below. See discussion https://github.com/USEPA/EPATADA/pull/557
# Find web service URLs for each Profile using WQP User Interface (https://www.waterqualitydata.us/)
# Example WQP URL: https://www.waterqualitydata.us/#statecode=US%3A09&characteristicType=Nutrient&startDateLo=04-01-2023&startDateHi=11-01-2023&mimeType=csv&providers=NWIS&providers=STEWARDS&providers=STORET
# Use TADA_ReadWQPWebServices to load the Station, Project, and Phys-Chem Result profiles
stationProfile <- TADA_ReadWQPWebServices("https://www.waterqualitydata.us/data/Station/search?statecode=US%3A09&characteristicType=Nutrient&startDateLo=04-01-2023&startDateHi=11-01-2023&mimeType=csv&zip=yes&providers=NWIS&providers=STEWARDS&providers=STORET")
physchemProfile <- TADA_ReadWQPWebServices("https://www.waterqualitydata.us/data/Result/search?statecode=US%3A09&characteristicType=Nutrient&startDateLo=04-01-2023&startDateHi=11-01-2023&mimeType=csv&zip=yes&dataProfile=resultPhysChem&providers=NWIS&providers=STEWARDS&providers=STORET")
projectProfile <- TADA_ReadWQPWebServices("https://www.waterqualitydata.us/data/Project/search?statecode=US%3A09&characteristicType=Nutrient&startDateLo=04-01-2023&startDateHi=11-01-2023&mimeType=csv&zip=yes&providers=NWIS&providers=STEWARDS&providers=STORET")
# Join all three profiles using TADA_JoinWQPProfiles
TADAProfile <- TADA_JoinWQPProfiles(FullPhysChem = physchemProfile, Sites = stationProfile, Projects = projectProfile)
# Run TADA_CheckRequiredFields, returns error message, 'The dataframe does not contain the required fields: ActivityStartDateTime'
TADA_CheckRequiredFields(TADAProfile)
# Add missing col
TADAProfile1 <- dataRetrieval:::create_dateTime(df = TADAProfile,
date_col = "ActivityStartDate",
time_col = "ActivityStartTime.Time",
tz_col = "ActivityStartTime.TimeZoneCode",
tz = "UTC")
review_TADAProfile1 = TADAProfile1 %>% dplyr::select(c("ActivityStartDate",
"ActivityStartTime.Time",
"ActivityStartTime.TimeZoneCode",
"ActivityStartDateTime",
"ActivityStartTime.TimeZoneCode_offset"))
# re-run TADA_CheckRequiredFields, returns TRUE
TADA_CheckRequiredFields(TADAProfile1)
I like the idea of a separate time zone code column. As you note, this seems like it would be much less likely to impact existing workflows.
From Laura D:
Just making sure we're both on the same page: In dataRetrieval, the default timezone is UTC set here: https://github.com/DOI-USGS/dataRetrieval/blob/main/R/readWQPdata.R#L200 You can read about changing timezones here: https://doi-usgs.github.io/dataRetrieval/reference/readWQPdata.html#arg-tz This sets the time zone attribute of the POSIX object.
Like this:
library(dataRetrieval) nameToUse <- "pH" pHData <- readWQPdata(siteid = "USGS-04024315", characteristicName = nameToUse, service = "ResultWQX") attr(pHData$Activity_StartDateTime, "tzone") [1] "UTC" pHData$Activity_StartDateTime[1] [1] "1975-09-27 15:50:00 UTC"
pHData2 <- readWQPdata(siteid = "USGS-04024315", characteristicName = nameToUse, tz = "America/Chicago", service = "ResultWQX") attr(pHData2$Activity_StartDateTime, "tzone") [1] "America/Chicago" pHData2$Activity_StartDateTime[1] [1] "1975-09-27 10:50:00 CDT"
So what you are asking for is a column that converts the offset number of hours to the timezone it was converted to?
Note there's also the link in the help to the OlsonNames() base R function which talks about how R handles timezones. The issue is that different operating systems and depending on where in the world the computer things you are will want different abbreviations for timezones (that's why using the OlsonNames is what has been working best for dataRetrieval). https://rdrr.io/r/base/timezones.html
Ideas from working group call: include both original sample local time and UTC
Local time might allow for better comparisons across tz (depending on what is being compared) - for ex: evening is not the same UTC across the country.