whitebox-tools icon indicating copy to clipboard operation
whitebox-tools copied to clipboard

wbt_breach_depressions_least_cost: different outputs from identical calls

Open vlahm opened this issue 4 years ago • 11 comments
trafficstars

Here's a demonstration of different cell values resulting from two identical calls to wbt_breach_depressions_least_cost. This issue doesn't seem to occur for wbt_breach_depressions. Setting R's RNG seed doesn't solve this, so maybe it has to do with Python's RNG. I'm on Ubuntu 20.04, running R 4.0.4 and whitebox 1.4.

library(tidyverse)
library(raster)
#> Loading required package: sp
#> 
#> Attaching package: 'raster'
#> The following object is masked from 'package:dplyr':
#> 
#>     select
#> The following object is masked from 'package:tidyr':
#> 
#>     extract
library(whitebox)
library(sf)
#> Linking to GEOS 3.8.0, GDAL 3.0.4, PROJ 6.3.1
library(elevatr)

lat <- 42.56257
long <- -71.10933
proj4 <- '+proj=cea +lon_0=-71.1093305 +lat_ts=0 +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs'

tmp <- tempdir()
dem_f <- paste0(tmp, '/dem.tif')
point_f <- paste0(tmp, '/point.shp')

site <- tibble(x = lat, y = long) %>%
    sf::st_as_sf(coords = c("y", "x"), crs = 4326) %>%
    sf::st_transform(proj4)

site_buf <- sf::st_buffer(x = site, dist = 1000)
dem <- elevatr::get_elev_raster(locations = site_buf, z = 10, verbose = FALSE)
#> Mosaicing & Projecting

test <- function(){

    raster::writeRaster(x = dem, filename = dem_f, overwrite = TRUE)
    sf::st_write(obj = site, dsn = point_f, delete_layer = TRUE, quiet = TRUE)

    whitebox::wbt_fill_single_cell_pits(dem = dem_f, output = dem_f)

    whitebox::wbt_breach_depressions_least_cost(
        dem = dem_f,
        output = dem_f,
        dist = 10000,
        fill = TRUE)

    return(raster::values(raster::raster(dem_f)))
}

vals_1 <- test()
vals_2 <- test()

any(na.omit(vals_1 != vals_2))
#> [1] TRUE

Created on 2021-03-25 by the reprex package (v1.0.0)

vlahm avatar Mar 25 '21 19:03 vlahm

I'm sorry, I don't quite understand what you are saying is wrong. Are you suggesting that multiple runs of the the least cost breaching tool produces different results? There is absolutely no stochastic component of this tool. There is no random number generator involved, certainly not R's or Pythons (the tool is written in Rust). Can you clarify what the issue is?

jblindsay avatar Mar 25 '21 20:03 jblindsay

Yes, multiple runs produce different results. The line at the end there is indicating that there is at least one cell value that differs between the two DEMs written. I just mentioned RNGs because I couldn't conceive of any other source for this.

vlahm avatar Mar 25 '21 20:03 vlahm

@vlahm Can you create the difference image and show it to see where these differed pixels are located?

giswqs avatar Mar 25 '21 20:03 giswqs

@giswqs The difference is not reproducible (it changes every time), but changing the last line in my reprex to which(! is.na(vals_1) & ! is.na(vals_2) & vals_1 != vals_2) will reveal the positions that are different.

vlahm avatar Mar 25 '21 21:03 vlahm

The only thing that I can think that this could be is small imprecisions associated with floating point equivalency. The outputs of this tool are 64-bit floats and there are very small values (flat increment values) that are created in the DEM. But I can confirm, there is no stochastic element to the tool, other than the imprecision of floating-point values (which is the case with anything working with floats). I suspect that if you do a DEM of difference, as @giswqs is suggesting, you will find the the elevation differences are incredibly small numbers.

jblindsay avatar Mar 25 '21 22:03 jblindsay

Here is a distribution of the differences across the 73188 cells that differed (out of 1324960 cells total). The maximum absolute difference is 19.9999 (meters).

diffs

vlahm avatar Mar 26 '21 00:03 vlahm

I looked into this because I was getting different watershed delineations from repeated calls to the same code. Here is the addendum to the above code that produced the above plot:

diffs = vals_1[different_cell_indices] - vals_2[different_cell_indices]
plot(density(diffs))

vlahm avatar Mar 26 '21 00:03 vlahm

I have performed the analysis and can confirm that consecutive runs of this tool will result in slightly different DEMs. The average difference in height is of the order of floating-point precision (the largest that I saw in my testing was about 0.0000227), but there are sometimes a small handful of cells in the DEM that will have appreciable differences in height, that can be meters in magnitude. This is because the breach paths are somewhat modified between runs resulting in breach paths that may cut through one neighbouring cell rather than another.

This in itself does not bother me. After all, unlike depression filling, for which there is really one possible solution, there are an infinite number of possible breach solutions. (There are many ways to fill a depression but ultimately there is only one form that filled depression can take at the end--other than the flats fixing component to the depression filling option.) What bothers me about this is that I cannot, at the moment, understand where this non-deterministic behaviour is coming from in the algorithm implementation. At first I thought that it was possibly due to the use of a Hashmap. Rust does not guarantee the order of items retrieved from Hashmaps. But I do not use a Hashmap in this tool. My instinct still tells me that it is the result of floating-point precision, but I would like to confirm this.

Regardless of all of this, I would argue that this behaviour of the tool should not stop anyone from using this tool. I simply want to understand, from a personal perspective, where the stochastic nature of the algorithm implementation is coming from. The differences between DEMs in consecutive runs are extremely small and there is no physical basis for judging one solution to be 'more correct' than another.

jblindsay avatar Mar 26 '21 11:03 jblindsay

Thank you for looking into this @jblindsay . I agree, this non-deterministic behavior does not diminish the accuracy of the tool. There is a use case for which it is prohibitive, however: storing and sharing specifications for large-scale hydrologic analysis. For example, say a team of developers is working on a global spatial analysis. Rather than sharing terabytes of raster data, they share specifications for deriving the same dataset. wbt_breach_depressions_least_cost cannot be part of such a project, because it may result in a different dataset in the hands of each developer. Concretely, watershed delineations can be substantially different depending on how a DEM is breached. For anyone who may need to perform the same delineation (or another task that requires flow conditioning) more than once, you may need to use wbt_breach_depressions, which does appear to produce identical output for identical calls.

vlahm avatar Mar 26 '21 13:03 vlahm

This is what is so confusing about this behaviour for me. I have not really coded this tool any differently than the wbt.breach_depressions tool with respect to the environment. Both use the same data types and general methods. Why one would be non-deterministic and the other not, is unexpected. I will continue to look into this behaviour to see if I can find the cause and eliminate it if possible.

jblindsay avatar Mar 26 '21 14:03 jblindsay

It is very bizarre! If Rust has a strict IEEE754-compliant mode, that might make 'wbt_breach_depressions_least_cost' cross-system deterministic, though I don't know what else it might introduce. This post has some interesting things to say.

vlahm avatar Mar 26 '21 16:03 vlahm