dataRetrieval icon indicating copy to clipboard operation
dataRetrieval copied to clipboard

Add functions to generate log files for pulled NWIS data

Open jds485 opened this issue 2 years ago • 2 comments

Data pulls from NWIS can result in different information from one day to the next, even if the same start and end date are used (e.g., as data move from provisional to approved, or as data are added/removed from NWIS). It is helpful for project development and reproducibility to generate log files that document the sites that were pulled from NWIS, and summary statistics for those sites. Tracking those log files in git helps to detect changes in the NWIS data.

It would be helpful if dataRetrieval included functions to generate such log files for each of the datasets that can be downloaded.

Examples for summarizing site data for daily and peak flows are in development within the regional-hydrologic-forcings-ml repo:

get_daily_flow_log <- function(files_in, file_out) {
  message(paste('generating log for dataRetrieval daily flow request'))
  daily_flow_list <- map(files_in, read_csv, 
              col_types = cols(agency_cd = col_skip(), 
                               site_no = col_character(), 
                               Date = col_date(format = "%Y-%m-%d"), 
                               discharge = col_double(), 
                               discharge_cd = col_character()))
  daily_flow_df <- bind_rows(daily_flow_list)
  daily_flow_log <- daily_flow_df %>%
    group_by(site_no) %>%
    summarize(num_days = n(), 
              start_date = min(Date), 
              end_date = max(Date), 
              mean_flow = mean(discharge, na.rm = TRUE), 
              sd_flow = sd(discharge, na.rm = TRUE))
  write_csv(daily_flow_log, file_out)
  return(file_out)
}

get_peak_flow_log <- function(files_in, file_out) {
  message(paste('generating log for dataRetrieval peak flow request'))
  peak_flow_list <- suppressWarnings(
    map(files_in, read_csv, 
        col_types = cols(agency_cd = col_skip(), 
                         site_no = col_character(), 
                         peak_dt = col_date(format = "%Y-%m-%d"), 
                         peak_tm = col_skip(), 
                         peak_va = col_double(), 
                         peak_cd = col_skip(), 
                         gage_ht = col_skip(), 
                         gage_ht_cd = col_skip(), 
                         year_last_pk = col_skip(), 
                         ag_dt = col_skip(), 
                         ag_tm = col_skip(), 
                         ag_gage_ht = col_skip(), 
                         ag_gage_ht_cd = col_skip(), 
                         peak_dateTime = col_skip(), 
                         ag_dateTime = col_skip())))
  peak_flow_df <- bind_rows(peak_flow_list)
  peak_flow_log <- peak_flow_df %>%
    group_by(site_no) %>%
    summarize(num_peaks = n(), 
              first_date = min(peak_dt), 
              last_date = max(peak_dt), 
              max_peak = max(peak_va, na.rm = TRUE))
  write_csv(peak_flow_log, file_out)
  return(file_out)
}

Examples for summarizing the total number of downloaded sites and site-days of data for daily and instantaneous water quality data are within drb-inland-salinity-ml repo:

summarize_site_list <- function(site_list_path,nwis_daily_data,nwis_inst_data,fileout){
  #' 
  #' @description Function to summarize data availability info and save log file 
  #'
  #' @param site_list_path file path and name of site list file, including the file extension
  #' @param nwis_daily_data data frame containing daily data for all NWIS daily sites
  #' @param nwis_inst_data data frame containing instantaneous data for all NWIS instantaneous sites
  #' @param fileout file path and name for output data, including the file extension
  #'
  #' @value A data frame containing the total number of observation days (discrete observations + NWIS days), the number of unique lat/lon locations, and the number of sites broken down by data source
  #' @example summarize_site_list(site_list_path = "2_process/out/site_list.csv",nwis_daily_data = daily_df,nwis_inst_data = inst_df)
  #' 

  # Read in list of sites  
  site_list <- read_csv(site_list_path,show_col_types = FALSE)

  # Summarize data availability
  site_summary <- site_list %>%
    summarize(
      # tally number of unique site id's across data sources
      n_unique_siteids = length(unique(site_id)),
      # tally number of unique lat/lon locations across data sources
      n_unique_latlon = as.numeric(tally(distinct(.,lat,lon))),
      # tally total number of WQP sites that contribute discrete data
      n_all_WQP_sites = as.numeric(tally(filter(.,count_days_discrete > 0))),
      # tally total number of NWIS-daily sites that contribute continuous data
      n_all_nwis_daily_sites = length(unique(nwis_daily_data$site_no)),
      # tally total number of NWIS-inst sites that contribute continuous data
      n_all_nwis_inst_sites = length(unique(nwis_inst_data$site_no)),
      # tally number of unique NWIS sites that contribute continuous data (daily or inst)
      n_unique_nwis_sites = as.numeric(tally(filter(.,grepl("NWIS",.$data_src_combined,ignore.case=TRUE)))),
      # tally number of site id's that have observations from both NWIS and WQP
      n_sites_intersect_WQPNWIS = as.numeric(tally(filter(.,.$count_days_discrete>0 & .$data_src_combined %in% c("NWIS_daily/Harmonized_WQP_data","NWIS_instantaneous/Harmonized_WQP_data")))),
      # tally total number of observation-days across data sources
      n_obsdays_total = sum(count_days_total,na.rm=TRUE),
      # tally number of observation-days from NWIS
      n_obsdays_nwis = sum(count_days_nwis,na.rm=TRUE),
      # tally number of discrete sample observation-days
      n_obsdays_discrete = sum(count_days_discrete,na.rm=TRUE))
  
  # Save data availability log file
  write_csv(site_summary, file = fileout)
  
  return(fileout)
}

The output log files look like this for the first example:

site_no n_obs mean sd
1411400 2 141 0
1411500 3158 77.06189 8.128901

And like this for the second:

n_unique_siteids n_unique_latlon n_all_WQP_sites n_all_nwis_daily_sites n_all_nwis_inst_sites n_unique_nwis_sites n_sites_intersect_WQPNWIS n_obsdays_total n_obsdays_nwis n_obsdays_discrete
3440 3367 3414 111 95 126 100 246088 179167 66921

jds485 avatar Jan 31 '22 16:01 jds485

(It's really weird, I replied to this earlier and now I don't see that comment. So...if suddenly there are 2 very similar responses from me, that's why I'm repeating myself).

I see the value of something like this. I think it would fit very well in a companion package that we are currently in the very early stages of planning. The package vision is data harmonization (ie, data from the water quality portal), data summaries (I think I have that exact group_by/summarize in there), and basic checks for QA/QC type issues.

The reason I'd prefer putting it a companion package is we've traditionally been carefully to keep dataRetrieval as a strict tool to take web service data and put it in an R object. If we started offering specific summary calculations, I'd be a little worried about project/scope creep ("hey, since you're offering that calculation, can you add this...and this...and...."). Especially when you consider how user-friendly R (tidyverse in your example above) is for creating these types of summaries. Another option I would fully support is to write a vignette/blog about how to create these types of summaries and why you need them.

ldecicco-USGS avatar Jan 31 '22 19:01 ldecicco-USGS

Sounds good! Feel free to migrate this to the companion package.

jds485 avatar Jan 31 '22 19:01 jds485