EPATADA Occasional test failure - TADA_FindPotentialDuplicatesMultipleOrgs does not grow dataset

We have noticed occasional failures of this test, although TADA_FindPotentialDuplicatesMultipleOrgs has not been edited recently.

The solution for this issue will require finding example data sets which cause this failure and modifying TADA_FindPotentialDuplicatesMultipleOrgs to address those scenarios.

Aug 22 '24 12:08 hillarymarler

Example data set that will fail this test:

df <- TADA_DataRetrieval(startDate = "2006-07-17",
endDate =  "2006-07-18",
statecode =  "DE")

Aug 22 '24 14:08 hillarymarler

Additional data sets that cause test failures for testing:

df2 <- TADA_DataRetrieval(startDate =  "2023-02-14",
endDate = "2023-02-15",
statecode =  "CO")

df3 <- TADA_DataRetrieval(startDate = "2010-11-30",
endDate =  "2010-12-01",
statecode = "AL" )

Aug 22 '24 17:08 hillarymarler

@wokenny13 - I think the extra rows are being added in situations where records from the same organization are being identified as duplicates in TADA_FindPotentialDuplicatesMultipleOrgs. And is a result of updates made to TADA_FindNearbySites.

Aug 22 '24 17:08 hillarymarler

I am also trying to take a look into this.

I ran TADA_FindPotentialDuplicatesMultipleOrgs and TADA_FindPotentialDuplicatesSingleOrg with the 1st df example.

The number of rows increased only for TADA_FindPotentialDuplicatesMultipleOrgs in which 25 were potentially identify which coincides with the number of rows that were increased.

TADA_FindPotentialDuplicatesSingleOrg identfiies potential duplicates of 44 results, but did not add additional rows in the 1st df example

Aug 22 '24 18:08 wokenny13

I think the issue may be here:

# get rid of results with no site group added - not duplicated spatially
  dupsites <- subset(dupsites, !dupsites$TADA.MonitoringLocationIdentifier %in% c("No nearby sites")) %>%
    tidyr::separate_rows(TADA.MonitoringLocationIdentifier, sep = ",")

As a result of changes to TADA_FindNearbySites

Aug 22 '24 18:08 hillarymarler

Values of logical values of NA were found in .data for TADA.MonitoringLocationIdentifier whereas values in dupsdat for TADA.MonitoringLocationIdentifier were character "NA".

typeof(dupsdat$TADA.MonitoringLocationIdentifier) [1] "character" df_nearby_sites_test <- TADA_FindNearbySites(df_ex) [1] "No nearby sites detected using input buffer distance." typeof(df_nearby_sites_test$TADA.MonitoringLocationIdentifier) [1] "logical"

Inserting this in line 1278 under # connect back to original dataset may be a solution

dplyr::mutate( TADA.MonitoringLocationIdentifier = ifelse(TADA.MonitoringLocationIdentifier %in% NA, "NA", TADA.MonitoringLocationIdentifier)) %>%

Unless there is a preferred variable type that would like to be converted to within the TADA_FindNearbySites() function.

Aug 23 '24 18:08 wokenny13