grass
grass copied to clipboard
[Bug] r.external -m can trigger reading of the whole dataset
Describe the bug
The -m flag of r.external says read data range from metadata. However, running r.external with a COG downloads the same amount of data as r.in.gdal or r.external without the -m flag, so there is no difference in download size between using and not using the -m flag because all the data is downloaded.
Given the use of the underlying GDAL function (discussed below), there is likely a discrepancy between the flag description and the actual behavior.
To Reproduce
$ grass8 -c EPSG:32620 ~/grassdata/epsg_32620
> r.external source=/vsicurl/http://oin-hotosm.s3.amazonaws.com/59c66c5223c8440011d7b1e4/0/7ad397c0-bba2-4f98-a08a-931ec3a6e943.tif output=test_m -m band=1
Use some network traffic monitoring tool to get the download size or use time as a proxy because most time is used by downloading anyway. I used sudo nethogs -v 3.
Expected behavior
The expected behavior to me is read data range from metadata (as the flag description says) or fail and don't process (and here also download) the whole dataset.
With the different options GDAL gives, it might be reasonable to have two or three flags to expose these options to the user. In this scenario, -m needs more precise description and another flag would implement the behavior promised by the current -m description.
Screenshots
r.in.gdal, r.external, r.external -m, r.external -r runs ordered by (real and user) time:
> time r.in.gdal input=/vsicurl/http://oin-hotosm.s3.amazonaws.com/59c66c5223c8440011d7b1e4/0/7ad397c0-bba2-4f98-a08a-931ec3a6e943.tif output=test_gdal band=1 --o
Importing raster map <test_gdal>...
100%
real 0m49.290s
user 0m31.064s
sys 0m1.745s
> time r.external source=/vsicurl/http://oin-hotosm.s3.amazonaws.com/59c66c5223c8440011d7b1e4/0/7ad397c0-bba2-4f98-a08a-931ec3a6e943.tif output=test_defaults band=1 --o
Reading band 1 of 3...
Link to raster map <test_defaults> created.
real 0m27.309s
user 0m10.392s
sys 0m1.787s
> time r.external source=/vsicurl/http://oin-hotosm.s3.amazonaws.com/59c66c5223c8440011d7b1e4/0/7ad397c0-bba2-4f98-a08a-931ec3a6e943.tif output=test_m -m band=1 --o
Reading band 1 of 3...
WARNING: Statistics in metadata are sometimes approximations: min and max
can be wrong!
Link to raster map <test_m> created.
real 0m23.442s
user 0m6.578s
sys 0m1.703s
> time r.external source=/vsicurl/http://oin-hotosm.s3.amazonaws.com/59c66c5223c8440011d7b1e4/0/7ad397c0-bba2-4f98-a08a-931ec3a6e943.tif output=test_r -r band=1 --o
Reading band 1 of 3...
Link to raster map <test_r> created.
real 0m0.503s
user 0m0.297s
sys 0m0.650s
sudo nethogs -v 3 output for multiple runs of each command, ordered by received (modified for display here):
2533743 r.external 0.732 105.834 MB <-- with or without -m
2533931 r.external 0.736 105.824 MB <-- with or without -m
2534347 r.external 0.723 105.820 MB <-- with or without -m
2534005 r.external 0.724 105.812 MB <-- with or without -m
2534310 r.external 0.713 105.811 MB <-- with or without -m
2533965 r.external 0.706 105.797 MB <-- with or without -m
2533819 r.external 0.215 105.199 MB <-- with or without -m
2534766 r.in.gdal 0.705 104.839 MB <-- r.in.gdal
2535108 r.in.gdal 0.698 104.833 MB <-- r.in.gdal
2534386 r.external 0.003 0.052 MB <-- with -r
2533865 r.external 0.003 0.052 MB <-- with -r
There is no distinction between r.external with or without -m flag, so I didn't even kept track of which one is which for the multiple runs of each, hence they are just mixed and marked as with or without -m.
System description (please complete the following information):
- Operating System: Linux
- GRASS GIS version: main
Additional context
The relevant GDAL call in source code is (pseudo code):
GDALGetRasterStatistics(..., bApproxOK=false, bForce=true, ...)
The GDALGetRasterStatistics does not have its own detailed description, but it links GDALRasterBand::GetStatistics which says:
- bApproxOK – If TRUE statistics may be computed based on overviews or a subset of all tiles.
- bForce – If FALSE statistics will only be returned if it can be done without rescanning the image.
So the call for r.external -m may re-scan the image and if it does, it will re-scan the full image (not just overviews or a subset).
Yes, indeed. At least a warning message would help the user to understand what is going on...