htslib icon indicating copy to clipboard operation
htslib copied to clipboard

GCE implementation does not support user pays buckets

Open d-cameron opened this issue 4 years ago • 5 comments

I am attempting to read from a BAM file stored in a GCE bucket that is configured as user pay and get the following error:

$ samtools view -H gs://bucket/file.bam
[E::hts_open_format] Failed to open file gs://bucket/file.bam
samtools view: failed to open "gs://bucket/file.bam" for reading: Invalid argument

Recompiling with a high level of hts_verbose (is there a better way to debug this?) dumps out:

> GET /file.bam HTTP/1.1
User-Agent: htslib/1.9 libcurl/7.29.0
Host: bucket.storage-download.googleapis.com
Accept: */*
Authorization: Bearer <GCS_OAUTH_TOKEN was here>
 
< HTTP/1.1 400 Bad Request
< X-GUploader-UploadID: <snip>
< Content-Type: application/xml; charset=UTF-8
< Content-Length: 266
< Date: Wed, 27 Nov 2019 13:09:53 GMT
< Expires: Wed, 27 Nov 2019 13:09:53 GMT
< Cache-Control: private, max-age=0
< Server: UploadServer
< Alt-Svc: quic=":443"; ma=2592000; v="46,43",h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000

According to the GCE documentation (https://cloud.google.com/storage/docs/json_api/v1/parameters), I need an extra header. Adding this custom header to a manual curl request is successful.

curl -H "X-Goog-User-Project: payingproject" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://bucket.storage-download.googleapis.com/file.bam

How can I inject a custom HTTP header into the htslib GCE request?

I see three potential approaches to fixing this:

  1. Adding a full GCE API to htslib. Seems overkill.
  2. Allowing arbitrary HTTP headers to be injected in the GCE request
  3. Document some libcurl configuration magic that will add a HTTP header to all libcurl requests (e.g. environment variables or a .netrc equivalent for headers). I couldn't find anything but that doesn't mean it's not there.

d-cameron avatar Nov 28 '19 02:11 d-cameron

Increasing hts_verbose is indeed the way to debug this; htsfile -v can do this conveniently, as can samtools --verbosity INT on develop.

See also #346 for thoughts on configuring S3 user-pays via HTSlib's S3 configuration file and/or variables. As so often, there is a branch to dust off (after the current dust has settled). IMHO it would be good for hfile_gcs.c to be able to generate user-pays headers similarly.

jmarshall avatar Nov 28 '19 08:11 jmarshall

For others encountering this issue: a workaround is to mount the bucket using gcs-fuse so it presents as a normal file system.

d-cameron avatar Nov 29 '19 21:11 d-cameron

I needed access to a Google Cloud Platform (GCP) storage bucket containing a large number of CRAMs that had the requester pays feature enabled. I found using gcsfuse on this large bucket had a sluggish performance. I added a GCS requester pays feature to htslib to help resolve my immediate needs.

I'm not sure if this meets the community standards for this feature, but I'm willing to amend it as necessary. Any feedback would be appreciated!

indraniel avatar Mar 14 '21 18:03 indraniel

@indraniel @d-cameron, I tried using the GCS_REQUESTER_PAYS_PROJECT=my-project-name option but I end up with the same error: Invalid argument. I tried with project-name and project-id with same result.

The last lines when using the --verbosity option are:

authorization: Bearer < my GCS_OAUTH_TOKEN >

* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 400 
< x-guploader-uploadid: ADPycdu-ouZqyc1jYHhl__xxxxx
< content-type: application/xml; charset=UTF-8
< content-length: 266
< date: Tue, 03 Aug 2021 16:25:01 GMT
< expires: Tue, 03 Aug 2021 16:25:01 GMT
< cache-control: private, max-age=0
< server: UploadServer
< alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
<
* Connection #0 to host w01010646.storage-download.googleapis.com left intact
[main_samview] fail to read the header from "gs://w01010646/XX.cram".

It is a storage bucket with the requester pays feature enabled too (I guess maybe we are talking about the same/similar one). Did you manage to use the samtools view directly?

migrau avatar Aug 03 '21 16:08 migrau

@migrau. If you want us to look at the problem could you open a new issue, This one is from 2019.

whitwham avatar Aug 04 '21 13:08 whitwham