mapme.biodiversity icon indicating copy to clipboard operation
mapme.biodiversity copied to clipboard

Enable datalake storage

Open fBedecarrats opened this issue 1 year ago • 52 comments

Background The package is implementing paralellization, now using a backend that works well with big data environments. The main bottlenecks now when large number of Ns are used is the way data is stored and accessed. Enabling the storage and access to data for datalake APIs (S3, Azure blob storage) could enable to further enhance the package performance.

Definition of done The package enables to reference a datalake storage (Azure or S3) in the same way than it enables to reference a location on local file system.

Complexity To be assessed. My first impression is that it would be low to medium.

fBedecarrats avatar Mar 31 '23 09:03 fBedecarrats

Back to this issue. Some benefits would be:

  • to avoid re-downloading resources for each project;
  • a quicker read-write when processing in the cloud;
  • to enhance performance of parallel processing as data transfer between cores seems the current bottleneck for this approach.

Possible first steps could be:

  • Identify occurrences of this question in existing issues and discussions on this repo to gather what has already been exchanged on the topic (cf. #164, #92)
  • clarify the spectrum of cloud solutions we need to support (S3 protocols for MinIO and AWS?, equivalents in Azure and GCP?)
  • prioritize some use cases we could start with (eg. annual forest cover loss? daily precipitations?)

I suggest we organize a specific webex focusing on this issue with the interested users/developers.

fBedecarrats avatar Jun 12 '23 15:06 fBedecarrats

An important distinction to realize here is that what is proposed in this issue does not change the overall paradigm of the package ("download first, computation later") but simply aims at allowing users to select (different?) cloud-backend storage providers instead of the local file directory. This is not the same as the discussion evolving about the usage of cloud-native geospatial formats. That discussion could actually alter the paradigm of the package to something like "query while computing".

goergen95 avatar Jun 12 '23 16:06 goergen95

For packages using S3 API specifications (AWS or MinIO), two R packages are available: {aws.s3}, which is the one I use, and {paws}. {aws.s3} has two functions that would come particularly handy in our case: s3read_using() and s3write_using() (see documentation).

What I would imagine is the following: Add an optional parameter like storage_type = to the init_portfolio section, that could take values such as c("local_filesystem", "aws_s3", "azure_blob", "gcp_whatever"), with local_filesystem being the default. Modify the and get_resource.R using one of this function

In, calc_indicator.R the following lines:

if (resource_type == "raster") {
      tindex <- read_sf(available_resources[resource_name], quiet = TRUE)
      out <- .read_raster_source(shp, tindex, rundir)
    } else if (resource_type == "vector") {
      out <- lapply(available_resources[[resource_name]], function(source) {
        tmp <- read_sf(source, wkt_filter = st_as_text(st_as_sfc(st_bbox(shp))))
        st_make_valid(tmp)
      })
      names(out) <- basename(available_resources[[resource_name]])
    } else {
      stop(sprintf("Resource type '%s' currently not supported", resource_type))
    }

would be modified with something like:

    if (resource_type == "raster") {
      if(storage_type == "aws_s3") {
        tindex <- aws.s3::s3read_using(x = available_resources[resource_name],
                                       FUN = read_sf)
        out <- aws.s3::s3read_using(x = available_resources[resource_name],
                                    FUN = .read_raster_source)
      } else {
        tindex <- read_sf(available_resources[resource_name], quiet = TRUE)
        out <- .read_raster_source(shp, tindex, rundir)
      }
    }

What do you think @goergen95 ? Just to clarify, this only takes advantage of the performance gains of reading cloud storage from a cloud computing environment. Further enhancement could come from improving the spatial filtering when reading, so that every reads only focuses on the area of interest.

fBedecarrats avatar Jun 14 '23 07:06 fBedecarrats

Here are some thoughts:

  1. Supporting each of the cloud infrastructures increases the dependencies of the package. In my point of view it is a very particular use-case thus I would opt to make this optional for users who need this (thus moving additional dependencies to SUGGEST and making sure that required namespaces are available)
  2. Concerning data I/O, you could investigate how far the GDAL Virtual File System drivers could be used to omit additional dependencies
  3. I am against an additional argument in init_portfolio(). Internal code should handle whether to write to the local file system or a supported cloud storage based on the string supplied to outdir.
  4. The code to support this should not directly be implemented in get_resources() or similar. In order to allow efficient maintenance and testing we would need to see working back-end code for writing and reading of raster and vector data for both the local file system and cloud storage types. These methods then should be called in get_resources() and elsewhere.
  5. Why do you expect improvements in the read performance in the cloud? I see the benefit that you can store the data in a shared bucket (or whatever name the providers give these things now) but I expect it to be slower compared to storing the data on the machine where your R instance runs.
  6. We already apply spatial filters when reading in the resources for a specific polygon.

goergen95 avatar Jun 14 '23 08:06 goergen95

The GDAL virtual file system seems indeed a great lead, although I'm stuck when I would expect the package to identify the resources. Below a detailed example (although not that reproducible if you don't have access to my platorm, maybe that could be worked out). So I have a MinIO S3 bucket named "fbedecarrats". On it, there is a folder "mapme_biodiversity" with a subfolder "chirps" that contains all blobal the resources used by mapme.biodiversity package for chirps.

library(tidyverse)
library(aws.s3) # the package used to access the S3 API 

get_bucket_df("fbedecarrats", prefix = "mapme_biodiversity", region = "") %>%
  head(5) %>%
  pluck("Key")

# [1] "mapme_biodiversity/chirps/chirps-v2.0.1981.01.cog" "mapme_biodiversity/chirps/chirps-v2.0.1981.02.cog"
# [3] "mapme_biodiversity/chirps/chirps-v2.0.1981.03.cog" "mapme_biodiversity/chirps/chirps-v2.0.1981.04.cog"
# [5] "mapme_biodiversity/chirps/chirps-v2.0.1981.05.cog"

Using the GDAL Virtual File System driver for S3, the access to files stores in S3 is straightforward: one just need to specify the location on the S3 bucket like if it was on the local filesystem and add "/vsis3/" at the beginning. Nota bene: the credentials to access the S3 storage must be set (it is automatic on my cloud environment, but otherwise it needs to be specified manually).

library(terra)
chirps1 <- rast("/vsis3/fbedecarrats/mapme_biodiversity/chirps/chirps-v2.0.1981.01.cog")
print(chirps1)
# class       : SpatRaster 
# dimensions  : 2000, 7200, 1  (nrow, ncol, nlyr)
# resolution  : 0.05, 0.05  (x, y)
# extent      : -180, 180, -50, 50  (xmin, xmax, ymin, ymax)
# coord. ref. : lon/lat WGS 84 (EPSG:4326) 
# source      : chirps-v2.0.1981.01.cog 
# name        : chirps-v2.0.1981.01 

The init_portfolio function seems to work at first sight.

library(sf)
library(mapme.biodiversity)
neiba <- system.file("extdata", "sierra_de_neiba_478140_2.gpkg", 
                     package = "mapme.biodiversity") %>%
  sf::read_sf()

pf <- init_portfolio(neiba, years = 2000:2020, 
                     outdir = "/vsis3/fbedecarrats/mapme_biodiversity")
str(pf)
# sf [1 × 6] (S3: sf/tbl_df/tbl/data.frame)
#  $ WDPAID   : num 478140
#  $ NAME     : chr "Sierra de Neiba"
#  $ DESIG_ENG: chr "National Park"
#  $ ISO3     : chr "DOM"
#  $ geom     :sfc_POLYGON of length 1; first list element: List of 4
#   ..$ : num [1:1607, 1:2] -71.8 -71.8 -71.8 -71.8 -71.8 ...
#   ..$ : num [1:5, 1:2] -71.4 -71.4 -71.4 -71.4 -71.4 ...
#   ..$ : num [1:4, 1:2] -71.5 -71.5 -71.5 -71.5 18.6 ...
#   ..$ : num [1:5, 1:2] -71.5 -71.5 -71.5 -71.5 -71.5 ...
#   ..- attr(*, "class")= chr [1:3] "XY" "POLYGON" "sfg"
#  $ assetid  : int 1
#  - attr(*, "sf_column")= chr "geom"
#  - attr(*, "agr")= Factor w/ 3 levels "constant","aggregate",..: NA NA NA NA NA
#   ..- attr(*, "names")= chr [1:5] "WDPAID" "NAME" "DESIG_ENG" "ISO3" ...
#  - attr(*, "nitems")= int 1
#  - attr(*, "bbox")= 'bbox' Named num [1:4] -71.8 18.6 -71.3 18.7
#   ..- attr(*, "names")= chr [1:4] "xmin" "ymin" "xmax" "ymax"
#   ..- attr(*, "crs")=List of 2
#   .. ..$ input: chr "WGS 84"
#   .. ..$ wkt  : chr "GEOGCRS[\"WGS 84\",\n    DATUM[\"World Geodetic System 1984\",\n        ELLIPSOID[\"WGS 84\",6378137,298.257223"| __truncated__
#   .. ..- attr(*, "class")= chr "crs"
#  - attr(*, "resources")= list()
#  - attr(*, "years")= int [1:21] 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 ...
#  - attr(*, "outdir")= chr "/vsis3/fbedecarrats/mapme_biodiversity"
#  - attr(*, "tmpdir")= chr "/tmp/RtmpXASngm"
#  - attr(*, "verbose")= logi TRUE
#  - attr(*, "testing")= logi FALSE

However, although all the cog files are present in the chirps subfolder, the resource were not recongized and the package attemps to download them again (which is not possible, as it cannot write on S3 with this protocol).

pf <- pf %>%
  get_resources("chirps")
# Starting process to download resource 'chirps'........
#   |                                                  | 0 % ~calculating  
# <simpleWarning in download.file(missing_urls[i], missing_filenames[i], quiet = TRUE,     mode = ifelse(Sys.info()["sysname"] == "Windows", "wb", "w")): URL https://data.chc.ucsb.edu/products/CHIRPS-2.0/global_monthly/cogs/chirps-v2.0.1981.01.cog: cannot open destfile '/vsis3/fbedecarrats/mapme_biodiversity/chirps/chirps-v2.0.1981.01.cog', reason 'No such file or directory'>
# Warning message:
# Download for resource chirps failed. Returning unmodified portfolio object.

pf <- pf %>%
   calc_indicators("precipitation_chirps",
                    engine = "exactextract",
                    scales_spi = 3,
                    spi_prev_years = 8)
# Error in .check_existing_resources(existing_resources, required_resources,  : 
#   The following required resource is not available: chirps.

The resources don't get recognized because they are indexed with the local path, eg.: "/home/onyxia/work/perturbations_androy/chirps/chirps-v2.0.1981.01.cog". I'll try to modify and replace.

# Read existing
tindex <- st_read("/vsis3/fbedecarrats/mapme_biodiversity/chirps/tileindex_chirps.gpkg")
# Correct path
tindex2 <- tindex %>%
  mutate(location = str_replace(location, 
                                "/home/onyxia/work/perturbations_androy/",
                                "/vsis3/fbedecarrats/mapme_biodiversity/"))
# write locally
st_write(tindex2, "tileindex_chirps.gpkg")
# replace object in S3
put_object(file = "tileindex_chirps.gpkg",
    object = "mapme_biodiversity/chirps/tileindex_chirps.gpkg",
    bucket = "fbedecarrats",
    region = "",
    multipart = TRUE)

After correcting the indexes in the tileindex, the presence of the resources is still not identified.

pf <- init_portfolio(neiba, years = 2000:2020, 
                     outdir = "/vsis3/fbedecarrats/mapme_biodiversity")
# Starting process to download resource 'chirps'........
#   |                                                  | 0 % ~calculating  
# <simpleWarning in download.file(missing_urls[i], missing_filenames[i], quiet = TRUE,     mode = ifelse(Sys.info()["sysname"] == "Windows", "wb", "w")): URL https://data.chc.ucsb.edu/products/CHIRPS-2.0/global_monthly/cogs/chirps-v2.0.1981.01.cog: cannot open destfile '/vsis3/fbedecarrats/mapme_biodiversity/chirps/chirps-v2.0.1981.01.cog', reason 'No such file or directory'>
# Warning message:
# Download for resource chirps failed. Returning unmodified portfolio object. 
pf <- pf %>%
  get_resources("chirps")
# Error in .check_existing_resources(existing_resources, required_resources,  : 
#   The following required resource is not available: chirps.

I don't understand why the package does not identify that the resource is already present, as it would do on the local filesystem.

fBedecarrats avatar Jun 17 '23 00:06 fBedecarrats

We are indeed in need of a reproducible example. Please set one up using the minio docker image and covering principles-first. We need to figure out how we can authenticate against the S3 server, and write and read vector and raster data using the driver before we can expect the package to auto-magically handle S3.

goergen95 avatar Jun 17 '23 04:06 goergen95

You are right, we need a reproducible example. Instead of creating an ad hoc S3 server that will need credentials anyways, I wonder if it is not simpler if I generate tokens that enable another user to access my S3 bucket on the existing Minio server from anywhere, or generate tokens that enable another user to access a running pod (RStudio server) where all the environment parameters are pre-set to connect to the S3 bucket. I would need to communicate the tokens through a private channel though. What do you think @goergen95 ?

fBedecarrats avatar Jun 17 '23 07:06 fBedecarrats

I do not favor that option because it is not reproducible by anyone else. It is also not about the environment parameters, because this is something the code will have to take care of eventually and I would actually like to see what is needed in terms on parameters in a reproducible example. Also, we will have to think about how to include tests for the new functionality in the package eventually. You might take some inspiration how to set things up from this repository here.

goergen95 avatar Jun 19 '23 06:06 goergen95

A question on the side: how great are the performance gains @fBedecarrats? Is it possible to make a benchmark? My intuition would tell me that a local file-storage is always superior, if "local" means in this context "locally in the cloud" whereever your R environment is installed. I would not expect read and write to be much faster with the cloud optimized storages and that being the bottleneck for mass processing...but of course I might be wrong, so a benchmark would be really great.

If performance gains for individual users are not much higher, than the value added of this feature would be to enable more collaboration across users and projects for a specific IT setup... we should discuss how far we want to support this, because there might be multiple solutions to that problem and I would see it more on the side of IT architects to enable collaboration within a specifc IT infrastructure given a tool/technology that exists.... instead of the other way around (Making your tool fit to a multitude of environments/IT setups)...

Note: In this specific case a shared network drive within your environment might already solve the problem and there would be no need for AWS or whatsoever...

Jo-Schie avatar Jun 21 '23 07:06 Jo-Schie

A question on the side: how great are the performance gains @fBedecarrats? Is it possible to make a benchmark? My intuition would tell me that a local file-storage is always superior, if "local" means in this context "locally in the cloud" whereever your R environment is installed. I would not expect read and write to be much faster with the cloud optimized storages and that being the bottleneck for mass processing...but of course I might be wrong, so a benchmark would be really great.

If performance gains for individual users are not much higher, than the value added of this feature would be to enable more collaboration across users and projects for a specific IT setup... we should discuss how far we want to support this, because there might be multiple solutions to that problem and I would see it more on the side of IT architects to enable collaboration within a specifc IT infrastructure given a tool/technology that exists.... instead of the other way around (Making your tool fit to a multitude of environments/IT setups)...

Note: In this specific case a shared network drive within your environment might already solve the problem and there would be no need for AWS or whatsoever...

YEs, your comments echoes @goergen95 comment above

  1. Why do you expect improvements in the read performance in the cloud? I see the benefit that you can store the data in a shared bucket (or whatever name the providers give these things now) but I expect it to be slower compared to storing the data on the machine where your R instance runs.

I am really not sure on this, but I thought that this might improve in the following sense: When I set a parallel computing strategy, like with the {future} option plan(cluster), the current R process becomes one cluster, and {future} creates additional clusters. Apparently, reading and transferring data become between clusters becomes a bottleneck when I reach 10-12 clusters. My hypothesis is that the initial cluster must transfer its data to the other clusters and this is slow. My idea then it that if all clusters read the data from a third-party source for which we have very good performance (ie. S3 in my case, or Azure blob storage in yours), even with concurrent access, then the bottleneck is resorbed and we would see significant performance improvements above 10-12 parallel workers, which is not the case currently. But I don't clearly understand the parallelization process and it is guesswork at this stage. I think it is worthwhile implement the S3 reading for the sake of data mutualization among several analyses, and performance improvement would be a cherry on the cake if it really works. Does that make sense?

fBedecarrats avatar Jun 21 '23 16:06 fBedecarrats

I agree in that we need to see some specific benchmark scripts to further discuss this issue. Also, consider that we make some promises in the README:

It supports computational efficient routines and heavy parallel computing in cloud-infrastructures such as AWS or AZURE using in the statistical programming language R.

So I think it is worthwhile to investigate how we can deliver on that promise by supporting different types of cloud storage "natively" in the package. I don't expect performance improvements right away but I also do not think it is a priority at this stage.

goergen95 avatar Jun 21 '23 16:06 goergen95

It is definitely an interesting hypothesis by @fBedecarrats . Seeing it from this perspective a scalable read and write may solve a bottleneck when extra cpu on top just does not yield any significant results anymore. You may also read the Wikipedia article on Amdahl's Law on this for starters.

In computer architecture, Amdahl's law (or Amdahl's argument) is a formula which gives the theoretical speedup in latency of the execution of a task at fixed workload that can be expected of a system whose resources are improved. It states that "the overall performance improvement gained by optimizing a single part of a system is limited by the fraction of time that the improved part is actually used".

And more below

Amdahl's law does represent the law of diminishing returns if one is considering what sort of return one gets by adding more processors to a machine, if one is running a fixed-size computation that will use all available processors to their capacity. Each new processor added to the system will add less usable power than the previous one. Each time one doubles the number of processors the speedup ratio will diminish, as the total throughput heads toward the limit of 1/(1 − p). This analysis neglects other potential bottlenecks such as memory bandwidth and I/O bandwidth. If these resources do not scale with the number of processors, then merely adding processors provides even lower returns.

It is quite what we observed in @Ohm-Np master thesis. Maybe @Ohm-Np can link an online copy of his thesis here?

Jo-Schie avatar Jun 21 '23 18:06 Jo-Schie

Ok, can we agree that this issue here is about enabling (some) cloud storage types? I would suggest that we discuss parallelization strategies and improvements elsewhere and further down the line.

goergen95 avatar Jun 22 '23 06:06 goergen95

Ok, can we agree that this issue here is about enabling (some) cloud storage types? I would suggest that we discuss parallelization strategies and improvements elsewhere and further down the line.

Yes!

fBedecarrats avatar Jun 22 '23 09:06 fBedecarrats

We are indeed in need of a reproducible example. Please set one up using the minio docker image and covering principles-first. We need to figure out how we can authenticate against the S3 server, and write and read vector and raster data using the driver before we can expect the package to auto-magically handle S3.

OK. After several attempts, it seems that I cannot set docker with the linux pods I am using on Kubernetes. I cannot do it neither with my work Windows PC. I need to find a machine on which I can launch docker. I don't know when I will be able to achieve that.

fBedecarrats avatar Jun 22 '23 09:06 fBedecarrats

Yes!

Great! Then I would suggest to focus on S3 and Azure Blob as a starting point, maybe Google Cloud Storage later. For S3 it should be possible to use minio to set up a testing environment. I am not sure about Azure. Reading and writing geospatial data through GDAL should be easy. The main problem I see is that we cannot list already existing files on these systems without further dependencies. We thus need repex for both storage types that show how to read/write raster and vector data and list existing files.

goergen95 avatar Jun 22 '23 09:06 goergen95

I just found out this simple way to launch minio (without docker) on a linux environment:

wget https://dl.min.io/server/minio/release/linux-amd64/archive/minio_20230619195250.0.0_amd64.deb -O minio.deb
sudo dpkg -i minio.deb
mkdir ~/minio
minio server ~/minio --console-address :9090

After this, the minio client is accessible locally with on the IP:port and with the credentials provided on the terminal. Similar setup procedures are available for MacOS and Windows (it does not work however with Windows 11, so I cannot test it locally).

fBedecarrats avatar Jun 22 '23 10:06 fBedecarrats

This is a complete procedure to run Minio and access it from R

In one terminal, run:

# Install MinIO
wget https://dl.min.io/server/minio/release/linux-amd64/archive/minio_20230619195250.0.0_amd64.deb -O minio.deb
sudo dpkg -i minio.deb
mkdir ~/minio
minio server ~/minio --console-address :9090

The terminal will remain buzy as long as MinIO is running. Open another terminal and run:

# Install MinioClient
wget https://dl.min.io/client/mc/release/linux-amd64/mc
chmod +x mc
sudo mv mc /usr/local/bin/mc

# Creates an alias
mc alias set local http://127.0.0.1:9000 minioadmin minioadmin
mc admin info local

# Creates a bucket
mc mb local/mapme

# Create a test file on the local filesystem
printf "blabla\nblibli\nbloblo" >> test.txt

# Send the test file to Minio
mc cp ~/work/test.txt local/mapme/test.txt

Now, we will use R to connect to the minio server that runs on the local machine:

library(aws.s3)
library(tidyverse)


# Set environment variables that aws.s3 use to connect to minIO
Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin",
           "AWS_SECRET_ACCESS_KEY" = "minioadmin",
           "AWS_DEFAULT_REGION" = "",
           "AWS_SESSION_TOKEN" = "",
           "AWS_S3_ENDPOINT"= "localhost:9000")

get_bucket("mapme", region = "", use_https = FALSE)
# Bucket: mapme 
# 
# $Contents
# Key:            test.txt 
# LastModified:   2023-06-23T11:21:09.747Z 
# ETag:           "cb7a754ec0d230b2a3e28ccb55957e6d" 
# Size (B):       20 
# Owner:          minio 
# Storage class:  STANDARD 

s3read_using(FUN = readLines,
             object = "test.txt",
             bucket = "mapme",
             opts = list("region" = "", "use_https" = "FALSE"))
# [1] "blabla" "blibli" "bloblo"
# Warning message:
#   In FUN(tmp, ...) :
#   incomplete final line found on '/tmp/RtmpCJttXA/file2504cdaa3d9.txt'

Can you please test that it works on your side @goergen95 ?

fBedecarrats avatar Jun 23 '23 11:06 fBedecarrats

I made a try with some geographic data using GDAL S3 driver, but for now it doesn't work with the local minio (although it works with the remote MinIO from SSP Cloud, see example above). I think that it is due to the fact that the local resource doesn't use https.

library(aws.s3)
library(tidyverse)


# Set environment variables that aws.s3 use to connect to minIO
Sys.setenv("AWS_ACCESS_KEY_ID" = "minioadmin",
           "AWS_SECRET_ACCESS_KEY" = "minioadmin",
           "AWS_DEFAULT_REGION" = "",
           "AWS_SESSION_TOKEN" = "",
           "AWS_S3_ENDPOINT"= "localhost:9000")

get_bucket("mapme", region = "", use_https = FALSE)
# Bucket: mapme 
# 
# $Contents
# Key:            test.txt 
# LastModified:   2023-06-23T11:21:09.747Z 
# ETag:           "cb7a754ec0d230b2a3e28ccb55957e6d" 
# Size (B):       20 
# Owner:          minio 
# Storage class:  STANDARD 

s3read_using(FUN = readLines,
             object = "test.txt",
             bucket = "mapme",
             opts = list("region" = "", "use_https" = "FALSE"))
# [1] "blabla" "blibli" "bloblo"
# Warning message:
#   In FUN(tmp, ...) :
#   incomplete final line found on '/tmp/RtmpCJttXA/file2504cdaa3d9.txt'



library(mapme.biodiversity)
library(sf)
library(terra)

# create an AOI like in package documentation
aoi <- system.file("extdata", "sierra_de_neiba_478140.gpkg", 
                        package = "mapme.biodiversity") %>%
  read_sf() %>%
  st_cast("POLYGON")

aoi_gridded <- st_make_grid(
  x = st_bbox(aoi),
  n = c(10, 10),
  square = FALSE
) %>%
  st_intersection(aoi) %>%
  st_as_sf() %>%
  mutate(geom_type = st_geometry_type(x)) %>%
  filter(geom_type == "POLYGON") %>%
  select(-geom_type, geom = x) %>%
  st_as_sf()

# get some GFC resource
sample_portfolio <- init_portfolio(aoi_gridded, years = 2010,
  outdir = ".") %>%
  get_resources("gfw_treecover")

# Copy the GFC resource to minio
put_object(file = "gfw_treecover/Hansen_GFC-2021-v1.9_treecover2000_20N_080W.tif",
           object = "gfw_treecover/Hansen_GFC-2021-v1.9_treecover2000_20N_080W.tif",
           bucket = "mapme",
           region = "", 
           use_https = FALSE)

get_bucket("mapme", region = "", use_https = FALSE)
# Bucket: mapme 
# 
# $Contents
# Key:            gfw_treecover/Hansen_GFC-2021-v1.9_treecover2000_20N_080W.tif 
# LastModified:   2023-06-23T11:39:38.021Z 
# ETag:           "ff12537644a35a34f88483b88d51e1fe" 
# Size (B):       119150611 
# Owner:          minio 
# Storage class:  STANDARD 
# 
# $Contents
# Key:            test.txt 
# LastModified:   2023-06-23T11:21:09.747Z 
# ETag:           "cb7a754ec0d230b2a3e28ccb55957e6d" 
# Size (B):       20 
# Owner:          minio 
# Storage class:  STANDARD 


my_rast <- rast("/vsis3/mapme/gfw_treecover/Hansen_GFC-2021-v1.9_treecover2000_20N_080W.tif")
# Error: [rast] file does not exist: /vsis3/mapme/gfw_treecover/Hansen_GFC-2021-v1.9_treecover2000_20N_080W.tif
# In addition: Warning message:
#   In new_CppObject_xp(fields$.module, fields$.pointer, ...) :
#   GDAL Error 11: CURL error: Could not resolve host: mapme.localhost

fBedecarrats avatar Jun 23 '23 11:06 fBedecarrats

Hi! I set up a Gist using Docker consisting of a RStudio and a minio server. You might still use the R script with some adaptations with your setup?

The results are the following: If we assume that the envrionment variables are set up correctly we can use the {aws.s3} package to write data to a bucket. The GDAL drivers does not have write capabilities for both raster and vector. We can use both, {aws.s3} and GDAL to read data from the minio bucket. Using GDAL, two environment variables have to be set for it to resolve the location of the bucket correctly.

Conclusion: Even-though we could use GDAL for reading data, it will fail if some env vars are not set up correctly. I thus opt for using {aws.s3} for setting up the read/write methods required to support S3 storage types in the package.

goergen95 avatar Jun 26 '23 08:06 goergen95

{aws.s3} currently does not rely on the AWS_DEFAULT_REGION and AWS_REGION env vars by default (as per cloudyr/aws.s3#371). We thus would either need custom code to better support both AWS and minio of we should look into alternatives (e.g. arrow).

goergen95 avatar Jun 26 '23 09:06 goergen95

{aws.s3} currently does not rely on the AWS_DEFAULT_REGION and AWS_REGION env vars by default (as per cloudyr/aws.s3#371). We thus would either need custom code to better support both AWS and minio of we should look into alternatives (e.g. arrow).

Yep. This region thing is the usual suspect for any problem. My understanding is that in many situations {aws.s3} sets us-east-1 as default if you don't specify it, so you need to pass region = "" in many situations (see example for s3read_using() or get_bucket().

fBedecarrats avatar Jun 26 '23 09:06 fBedecarrats

Yep. This region thing is the usual suspect for any problem. My understanding is that in many situations {aws.s3} sets us-east-1 as default if you don't specify it, so you need to pass region = "" in many situations (see example for s3read_using() or get_bucket().

It's here: https://github.com/cloudyr/aws.s3/blob/master/R/get_location.R

fBedecarrats avatar Jun 26 '23 09:06 fBedecarrats

Now that we have a working reproducible workflow and keeping in mind this question with region, next questions could be:

  1. how do we work on this? Shall we create a testing branch dedicated to this feature in the original repo or shall each of us work on separate forks?
  2. what do you think should be the most efficient approach to enable s3 read-write while minimizing the implications for the existing functions?

Some ideas for 2:

  • if we do not want to add arguments to init_portfolio(), should it test whether outdir is a cloud storage service and store it as an attribute to the output portfolio object (eg. attr(x, "cloud_storage") <- cloud_storage ?
  • is so, the functions get_resource() and calc_indicator() should have a variant when it comes to write or read and use s3read_using() and s3write_using() if attr(x, "cloud_storage") == TRUE? I'm mentionning these ideas, but I think that it is not very satisfying as it would overload existing functions. Ideally, we should make this modular and have a specific independent functions that would handle the clould_storage specificities, while minimizing the modifications of existing functions. What do you think?

fBedecarrats avatar Jun 26 '23 09:06 fBedecarrats

Regarding 1: creating a dedicated branch in this repo is the way to go in my view. Regarding 2:

  • I would like that users specify something like : outdir = "s3://<bucket-name>". We then assume that environment variables are set up correctly and the rest should be auto-magically handled
  • for this to work, we definitely need to modularize read/write code out of get_resources() and calc_indicators()
  • I think the best way for get_resources() to work is to download data to a temporal directory and then either push the data to outdir on the local file system or to the respective cloud storage type
  • not sure about the best way how to handle reading data within calc_indicators()

goergen95 avatar Jun 26 '23 09:06 goergen95

Thanks. I won't be able to work on this today and tomorrow, but I think that I will be able to dedicate some time on Wednesday and Thursday.

fBedecarrats avatar Jun 26 '23 09:06 fBedecarrats

Just another note: {aws.s3} seems like it is no longer actively maintained (last release on 2020-04-07). I think it is high-quality, but I would not like to add an unmaintained dependency to the package. I think it would be worth to investigate some alternatives.

goergen95 avatar Jun 26 '23 10:06 goergen95

{paws} is actively maintained, but maybe less mature... https://github.com/paws-r/paws

fBedecarrats avatar Jun 26 '23 10:06 fBedecarrats

The main drawback I see is that, althoug {paws} has many functions to interact with S3, it lacks the equivalent to aws.s3::s3read_using() or aws.s3::s3write_using(). These functions are simple (see code here) as they mostly rely on aws.s3::put_object() and aws.s3::save_object(), for which we have equivalents in {paws} (paws::put_object() and paws::get_object() respectively).

fBedecarrats avatar Jun 26 '23 10:06 fBedecarrats

paws seems to be quite heavy on dependencies... maybe its a valid alternative if we just can rely on paws.storage

goergen95 avatar Jun 26 '23 10:06 goergen95