paws icon indicating copy to clipboard operation
paws copied to clipboard

Add paginators

Open davidkretch opened this issue 7 years ago • 4 comments

davidkretch avatar Nov 17 '18 16:11 davidkretch

Hi colleagues,

It seems that I face the paginator challenge myself :)

Was trying to get my all time historical trainings:

sm_client <- paws::sagemaker(config = list(region = myregion') )
total_training_jobs <- list()
j <- 1
sequence_var <- seq.POSIXt(from = as.POSIXct("2020-04-01 00:00:00"), to=as.POSIXct("2020-11-20 00:00:00"), by="hour")
for(i in sequence_var){
total_training_jobs[[j]] <- sm_client$list_training_jobs(MaxResults=100, CreationTimeAfter = i)
j <- j+1
}  

And I got a nice 400 ThrottlingException.

Anyone that has tried a workaround?

BR /E

edgBR avatar Nov 19 '20 16:11 edgBR

Hey, sorry about that. I'll look into this this weekend. To my knowledge the approach to this is to delay requests some amount.

davidkretch avatar Nov 20 '20 15:11 davidkretch

I put together this attempt at a paginator. You supply it with your AWS API call as the argument to parameter f and it will take care of fetching each page of results and returning them as a list. Below this function is an example call. Let me know if this helps or not.

# Get all pages of a given API call, retrying with exponential backoff.
paginate <- function(f, max_retries = 5) {
  resp <- f
  result <- list(resp)
  while ("NextToken" %in% names(resp) && length(resp$NextToken) > 0 && resp$NextToken != "") {
    next_token <- resp$NextToken
    call <- substitute(f)
    call$NextToken <- next_token
    # Retry with exponential backoff.
    # See https://docs.aws.amazon.com/general/latest/gr/api-retries.html.
    # See also https://github.com/paws-r/paws/blob/main/examples/error_handling.R.
    retry <- TRUE
    retries <- 0
    while (retry && retries < max_retries) {
      resp <- tryCatch(eval(call), error = function(e) e)
      if (inherits(resp, "error")) {
        if (retries == max_retries) stop(resp)
        wait_time <- 2^retries / 10
        Sys.sleep(wait_time)
        retries <- retries + 1
      }
      else retry <- FALSE
    }
    result <- c(result, list(resp))
  }
  return(result)
}

For an example, see below (using CloudWatch instead of SageMaker in my case). In your case, you'll need to modify the call to use a fixed creation time, e.g. sm_client$list_training_jobs(MaxResults=100, CreationTimeAfter = as.POSIXct("2020-04-01 00:00:00")). With a fixed creation time, the API will split the results into pages and the paginator will fetch each one (hopefully) up to the present.

results <- paginate(
  cw$get_metric_data(
    MetricDataQueries = metric_data_queries,
    StartTime = as.POSIXct("2020-01-01"),
    EndTime = as.POSIXct("2020-11-22")
  )
)

davidkretch avatar Nov 22 '20 22:11 davidkretch

Of course,

How bad of me to have overlooked the next token workaround.

The solution is working perfectly @davidkretch, thanks for that!

BR

edgBR avatar Nov 23 '20 10:11 edgBR

@davidkretch @adambanker

For paginates I am toying around the idea of an apply method:

So we have the standard paginator that will loop over every token.

library(paws.common)

s3 <- paws::s3()

out <- paginate(
  S3$list_objects_v2(
    Bucket = "my_bucket"
  )
)

Secondly we have the apply "family" of paginators that allow users to use a function on each response from the operation.

Basic example:

out <- paginate_lapply(
  S3$list_objects_v2(
    Bucket = "my_bucket"
  ),
  \(resp) {
    resp$Contents
  }
)

What are your thoughts on this? Would like your feedback before I go too far down the rabbit's hole 😆

DyfanJones avatar Jul 25 '23 19:07 DyfanJones

paws v-0.4.0 has now been released to the cran. I will close this ticket for now.

DyfanJones avatar Sep 15 '23 16:09 DyfanJones