googleCloudStorageR icon indicating copy to clipboard operation
googleCloudStorageR copied to clipboard

Resumable uploads failing

Open MarkEdmondson1234 opened this issue 4 years ago • 18 comments

As reported in #120

MarkEdmondson1234 avatar May 22 '20 07:05 MarkEdmondson1234

@BillPetti wrote:

I'm facing what I think is a similar issue, but in my case the upload is actually failing. I am not asking for it to find a resumable upload, but when I try to upload an updated file it appears to find on and hangs after reading about half of the file, then I get this message:

<- HTTP/2 408 
<- content-type: text/html; charset=UTF-8
<- referrer-policy: no-referrer
<- content-length: 1557
<- date: Sat, 16 May 2020 16:10:06 GMT
<- alt-svc: h3-27=":443"; ma=2592000,h3-25=":443"; ma=2592000,h3-T050=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
<- 
2020-05-16 12:10:06 -- File upload failed, trying to resume...
2020-05-16 12:10:06 -- Retry 3 of 3
Error in gcs_retry_upload(upload_url = upload_url, file = temp, type = type) : 
  Must supply either retry_object or all of upload_url, file and type
Calls: gcs_upload ... do_upload -> do_resumable_upload -> gcs_retry_upload
In addition: Warning messages:
1: No JSON content detected 
2: In doHttrRequest(req_url, shiny_access_token = shiny_access_token,  :
  API checks failed, returning request without JSON parsing
Execution halted

And here's my original call:

gcs_upload(file = r_object,
           object_function = f,
           upload_type = 'simple',
           name = 'directory/file_name')

MarkEdmondson1234 avatar May 22 '20 09:05 MarkEdmondson1234

@BillPetti could you rerun the upload that fails with options(googleAuthR.verbose = 1) so we can get more logging info.

Also what type of file is uploading - is it a big file and/or an R list or data.frame?

MarkEdmondson1234 avatar May 22 '20 09:05 MarkEdmondson1234

I have a similar issue with a large RDS file (9 GB) - whenever I try to upload it, I get

gcs_upload("full_IAT_data_file.RDS", name = "full_IAT_data_file.RDS", bucket = "iat_data")
2020-12-16 21:26:30 -- File size detected as 9.8 Gb
2020-12-16 21:26:30 -- Found resumeable upload URL: https://www.googleapis.com/upload/storage/v1/b/iat_data/o/?uploadType=resumable&name=full_IAT_data_file.RDS&predefinedAcl=private&upload_id=ABg5-UyYJCKTjF10-whqQa3ohDt8ELcAFPXjzxLgutIt4xjqKMPnmq99595PIRLLCf_3ZnFubw2I2NqzaJwK0oQb8oZrL5og3w
2020-12-16 21:27:55 -- File upload failed, trying to resume...
2020-12-16 21:27:55 -- Retry 3 of 3
Error: Must supply either retry_object or all of upload_url, file and type

Rerunning it with options(googleAuthR.verbose = 1) ends with

<- HTTP/2 400 
<- x-guploader-uploadid: ABg5-Uw53IDqndX7BtNbfvHuWpplSzb37rmJkv-Isl7pVy5by8rUJRuFP60ATBwSSWTVowvkwZ73Usp4GumfQ11h0XA
<- content-type: application/json; charset=UTF-8
<- date: Wed, 16 Dec 2020 20:53:19 GMT
<- vary: Origin
<- vary: X-Origin
<- cache-control: no-cache, no-store, max-age=0, must-revalidate
<- expires: Mon, 01 Jan 1990 00:00:00 GMT
<- pragma: no-cache
<- content-length: 498
<- server: UploadServer
<- 
2020-12-16 20:53:19 -- File upload failed, trying to resume...
2020-12-16 20:53:19 -- Retry 3 of 3
Error: Must supply either retry_object or all of upload_url, file and type

Given that I am trying to save from Google Cloud Engine within the same region, I thought I would give simple upload a go - however, that fails because the option needs to be specified as an integer. Any other suggestions?

LukasWallrich avatar Dec 16 '20 21:12 LukasWallrich

Apparently, for me the issue was that I did not choose "Fine-grained: Object-level ACLs enabled" when creating the bucket. With that, the upload seems to work now. Not sure if that is a general limitation, or because of how I created the JSON ... but all seems well for now (even though it might be worth documenting this, in case it is a common mistake people make?) Many thanks for this helpful package (and I will be back if the issue reappears :)).

LukasWallrich avatar Dec 16 '20 21:12 LukasWallrich

Thanks @LukasWallrich - this is a tricky one to pin down as I need to find a failing example to replicate. I think in your case you were missing the new Acl parameter defined in https://github.com/cloudyr/googleCloudStorageR/issues/111

gcs_upload(mtcars, bucket = "mark-bucketlevel-acl",
                   predefinedAcl = "bucketLevel")

Perhaps I can use this to test the above retry issue :)

MarkEdmondson1234 avatar Dec 17 '20 06:12 MarkEdmondson1234

@MarkEdmondson1234 I'm having the same or similar issue when I upload a batch of pdf files. I have a list of 500 pfd files that I upload via a for loop. Each time I do this a different subset of files will fail, so I don't think it is an issue with the files. You'll see in my script that I log which ones fail, then run the loop again on just those, and many of them upload fine on round 2, then I do a round 3. I'll also include the log so you can see the errors.

Upload script with three rounds of uploads

library(tidyverse)
library(fs)
library(googleCloudStorageR)


my_dir <- "<your dir here>"

write(x = as.character(Sys.time()), file = paste0(my_dir, "/log.txt"), append = TRUE)

# list files for upload
my_files <- dir_ls(
  path    = here::here("downloads"),
  glob    = "*.pdf",
  recurse = TRUE
) %>% unique()

total <- length(my_files)

# gcs_create_bucket(
#  "capitol-docs",
#  project_id,
#  location      = "US",
#  storageClass  = "STANDARD",
#  predefinedAcl = "publicRead",
#  predefinedDefaultObjectAcl = "bucketOwnerFullControl"
# )

# modify boundary between simple and resumable uploads
# By default the upload_type will be 'simple' if under 5MB, 'resumable' if over 5MB. Use gcs_upload_set_limit to modify this boundary - you may want it smaller on slow connections, higher on faster connections. 'Multipart' upload is used if you provide a object_metadata.
gcs_upload_set_limit(upload_limit = 2500000L)

#options(googleAuthR.verbose = 0)




 #---- ROUND 1: TRY TO UPLOAD ALL FILES ----

# upload
for (i in seq_along(my_files)) {
  
  skip_to_next <- FALSE
  closeAllConnections()
  Sys.sleep(.5)
  message("... ", i, " of ", total, " ... trying to uplod ",  path_file(my_files[i]))
 
  tryCatch(
    
   expr = 
   {
    gcs_upload(
     file = my_files[i],
     bucket = "capitol-docs",
     name = path_file(my_files[i]),
     predefinedAcl = "bucketLevel"
    )
   },
   error = function(e) {
    message("... Upload seems to have failed for ", i, ":\n")
    write(x = paste0(my_files[i], "\n", e), file = paste0(my_dir, "/log.txt"), append = TRUE)
    skip_to_next <<- TRUE
   }
   
  )

  if(skip_to_next) { next }
  
}

# check bucket contents
#bucket_contents <- gcs_list_objects("capitol-docs")
# delete contents
#map(bucket_contents$name, gcs_delete_object, bucket = "capitol-docs")

closeAllConnections()
gc()




#---- ROUND 2: TRY FAILED FILES AGAIN ----

my_failed_files <- readr::read_lines("log.txt") %>% 
  as_tibble() %>% 
  filter(str_detect(value, "pdf$")) %>% 
  drop_na() %>% 
  pull(value)

new_total <- length(my_failed_files)

# upload
for (i in seq_along(my_failed_files)) {
  
  skip_to_next <- FALSE
  closeAllConnections()
  Sys.sleep(.5)
  message("... ", i, " of ", new_total, " ... trying to uplod ",  path_file(my_failed_files[i]))
  
  tryCatch(
    
    expr = 
      {
        gcs_upload(
          file = my_failed_files[i],
          bucket = "capitol-docs",
          name = path_file(my_failed_files[i]),
          predefinedAcl = "bucketLevel"
        )
      },
    error = function(e) {
      message("... Upload seems to have failed for ", i, ":\n")
      write(x = paste0(my_failed_files[i], "\n", e), file = paste0(my_dir, "/log2.txt"), append = TRUE)
      skip_to_next <<- TRUE
    }
    
  )
  
  if(skip_to_next) { next }
  
}




#---- ROUND 3: TRY FAILED FILES FROM ROUND 2 AGAIN ----

my_failed_files2 <- readr::read_lines("log2.txt") %>% 
  as_tibble() %>% 
  filter(str_detect(value, "pdf$")) %>% 
  drop_na() %>% 
  pull(value)

new_total2 <- length(my_failed_files2)

# upload
for (i in seq_along(my_failed_files2)) {
  
  skip_to_next <- FALSE
  closeAllConnections()
  Sys.sleep(.5)
  message("... ", i, " of ", new_total2, " ... trying to uplod ",  path_file(my_failed_files2[i]))
  
  tryCatch(
    
    expr = 
      {
        gcs_upload(
          file = my_failed_files2[i],
          bucket = "capitol-docs",
          name = path_file(my_failed_files2[i]),
          predefinedAcl = "bucketLevel"
        )
      },
    error = function(e) {
      message("... Upload seems to have failed for ", i, ":\n")
      write(x = paste0(my_failed_files2[i], "\n", e), file = paste0(my_dir, "/log3.txt"), append = TRUE)
      skip_to_next <<- TRUE
    }
    
  )
  
  if(skip_to_next) { next }
  
}

Logs

Log 1

2021-03-25 16:44:00
/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/anderson_john_steven/anderson_indictment.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/caldwell_thomas_edward/caldwell_crowl_watkins_parker_parker_young_st.pdf
Error: Must supply either retry_object or all of upload_url, file and type

/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/caldwell_thomas_edward/caldwell_et_al_indictment.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/ciarpelli_albert_a/ciarpelli_statement_of_facts.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/crowl_donovan_ray/watkins_crowl_and_caldwell_indictment.pdf
Error: Must supply either retry_object or all of upload_url, file and type

/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/cudd_jenny_louise/cudd_rosa_statement_of_facts.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/evans_iii_treniss_jewell/evans_iii_statement_of_facts.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/fairlamb_scott_kevin/fairlamb_indictment.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/fairlamb_scott_kevin/fairlamb_scott_complaint_and_affidavit.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/griffin_couy/griffin_affidavit.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/griffin_couy/griffin_complaint.pdf
Error: Request failed before finding status code: HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)

/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/johnson_adam/johnson_statement_of_facts.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/montgomery_patrick/montgomery_complaint.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/montoni_corinne/montoni_affidavit.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/nalley_verden_andrew/calhoun_and_nalley_indictment.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/nichols_ryan/nichols_complaint.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/nordean_ethan_aka_ruffio_panman/nordean_complaint_and_affidavit.pdf
Error: Must supply either retry_object or all of upload_url, file and type

/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/norwood_iii_william_robert/norwood_complaint.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/norwood_iii_william_robert/norwood_indictment.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/packer_robert_keith/packer_statement_of_facts.pdf
Error: Must supply either retry_object or all of upload_url, file and type

Log 2

/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/caldwell_thomas_edward/caldwell_et_al_indictment.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^


/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/ciarpelli_albert_a/ciarpelli_statement_of_facts.pdf
Error: lexical error: invalid char in json text.
                                       <!DOCTYPE html> <html lang=en> 
                     (right here) ------^

Log 3

/Users/jeremyallen/Dropbox/Data/capitol-attack/downloads/caldwell_thomas_edward/caldwell_et_al_indictment.pdf
Error: Request failed before finding status code: HTTP/2 stream 0 was not closed cleanly: PROTOCOL_ERROR (err 1)

jeremy-allen avatar Mar 26 '21 19:03 jeremy-allen

This upload should work, I should at least add better logging such as the status code (you could see this via options(googleAuthR.verbose=2))

Do you need the PDFs uploaded as separate files? Just to work around your particular issue you could look at gce_save_all() which zips a folder and uploads that instead.

MarkEdmondson1234 avatar Mar 27 '21 07:03 MarkEdmondson1234

This upload should work, I should at least add better logging such as the status code (you could see this via options(googleAuthR.verbose=2))

Do you need the PDFs uploaded as separate files? Just to work around your particular issue you could look at gce_save_all() which zips a folder and uploads that instead.

I'll try the more verbose logging. I'll try a zip file, too.

jeremy-allen avatar Mar 28 '21 18:03 jeremy-allen

I've also been plagued by my uploads hanging. Finally found the solution today.

It seems that when you upload a file like gcs_upload(file = "foo.rds", bucket = "mybucket"), the file automatically is classified with "Restricted Access". You can see this under the Public Access column of the bucket list view in Google Cloud Storage.

Once this happens, you cannot overwrite the file (or at least I couldn't). For me, every attempt to overwrite the file using the same call to gcs_upload(file = "foo.rds", bucket = "mybucket") resulted in R hanging, waiting on a response.

The trick was to delete the file, then re-upload it using gcs_upload(file = "foo.rds", bucket = "mybucket", predefinedAcl = "bucketLevel") in which case the Public Access would be classified as "Not Public". At this point, I am able to overwrite foo.rds using the same call to gcs_upload(file = "foo.rds", bucket = "mybucket", predefinedAcl = "bucketLevel")

Screen Shot 2021-06-16 at 12 52 30 PM

ben519 avatar Jun 16 '21 17:06 ben519

Ooooh thanks that makes sense - so the resumable upload needs to have the same ACL permissions as the original upload - which would explain an uptick of these reports when GCS bought in bucket level ACL vs object level.

Is there a change in the code that can be made to make this easier to avoid?

MarkEdmondson1234 avatar Jun 16 '21 18:06 MarkEdmondson1234

Perhaps predefinedAcl = "bucketLevel" should be the default? Not sure what the implications of this would be.

ben519 avatar Jun 16 '21 18:06 ben519

I finally got a situation where I could make it fail and found a bug for checking the retry, so it should at least attempt a retry now.

MarkEdmondson1234 avatar Jan 03 '22 13:01 MarkEdmondson1234

Looks like I've run into this issue while using targets. Most targets were succeeding with repository = 'gcp' & the default predefined_acl = 'private' but larger files were failing unless I set predefined_acl='bucketLevel'.

googleCloudStorageR-specific reprex included below

Setup

Standard Bucket, europe-west2-b region, Uniform access control, No public access, no versioning.

Centos 7 in GCP R: 4.2.0 Targets: f37af16 Stantargets: 4ee5367 Cmdstan: 2.30.0 CmdstanR: 0.5.2.1 googleCloudStorageR: 0.7.0.9000 (updated after posting targets issue reprex).

Reprex

readRenviron('my_gcs.env')
library(googleCloudStorageR)
#> ✔ Setting scopes to https://www.googleapis.com/auth/devstorage.full_control and https://www.googleapis.com/auth/cloud-platform
#> ✔ Successfully auto-authenticated via my-server-key.json
#> ✔ Set default bucket name to 'my-default-bucket'
my_bucket <- "my-default-bucket"
# Create 5.7MB csv file
payload<-as.data.frame(matrix(rep(1, 3e6), nrow = 1e3))
write.csv(payload, tmpfile<-tempfile())

googleCloudStorageR::gcs_upload(tmpfile, bucket = my_bucket)
#> ℹ 2022-07-22 16:09:20 > File size detected as 5.7 Mb
#> ℹ 2022-07-22 16:09:20 > Found resumeable upload URL:  https://storage.googleapis.com/upload/storage/v1/b/my-default-bucket/o/?uploadType=resumable&name=tmpfile&predefinedAcl=private&upload_id=ADPycdu_o6vVcIQm5iH3g4JtJV5g6LCGPD3b6R9F5y2aZdUl7azw6ovQb1Af9xh4qMIyCapT-GhoRuN-S5-Iep4-h95tS68RC1C7
#> ℹ 2022-07-22 16:09:21 > File upload failed, trying to resume...
#> ℹ 2022-07-22 16:09:21 > Retry 3 of 3
#> Error in value[[3L]](cond): Couldn't get upload status

Created on 2022-07-22 by the reprex package (v2.0.1)

Comments

This looks like it may be a long-standing problem, so perhaps it is tough to resolve for all use-cases? What would be the reasonable resolution?

Should the targets default permissions be 'private', or 'bucketLevel'? Perhaps the success of the other files uploaded with the default acl=private is actually the bug? If it's not readily resolved at this end, perhaps it would be wise to give some guidance in the targets manual @wlandau - it's unexpected for a user since most targets complete successfully with acl=private, so people will end up getting frustrated when targets fail only occasionally.

stuvet avatar Jul 22 '22 16:07 stuvet

The default should be bucket level I think since it's by far the most convenient, I think the GCP interface nudges you in that direction when creating the bucket. That level of access is newer though which is why it wasn't default before. There is some logic to retry the fetch with bucket level permissions upon failure since it's so common, which I wonder why hasn't triggered in your case.

I'm finishing writing a book at the moment so am behind on issues.

MarkEdmondson1234 avatar Jul 22 '22 16:07 MarkEdmondson1234

Ok the logic to retry with bucket level permissions is only in for getting objects, not putting them in.

MarkEdmondson1234 avatar Jul 22 '22 16:07 MarkEdmondson1234

I'll take a look at the retry logic & see if I can figure it out. It's the least I can do for all the hard work you put into the targets integration - I was previously using GCP for everything but targets, so had to add AWS->GCP steps within the pipelines - annoying!

stuvet avatar Jul 22 '22 16:07 stuvet

Much appreciated! And glad to see the target integration with GCP being used in the wild.

MarkEdmondson1234 avatar Jul 22 '22 17:07 MarkEdmondson1234

Using predefinedAcl option gcs_upload(predefinedAcl = "bucketLevel") solves it for me

Kvit avatar Sep 02 '22 04:09 Kvit