htslib
htslib copied to clipboard
GCE implementation does not support user pays buckets
I am attempting to read from a BAM file stored in a GCE bucket that is configured as user pay and get the following error:
$ samtools view -H gs://bucket/file.bam
[E::hts_open_format] Failed to open file gs://bucket/file.bam
samtools view: failed to open "gs://bucket/file.bam" for reading: Invalid argument
Recompiling with a high level of hts_verbose
(is there a better way to debug this?) dumps out:
> GET /file.bam HTTP/1.1
User-Agent: htslib/1.9 libcurl/7.29.0
Host: bucket.storage-download.googleapis.com
Accept: */*
Authorization: Bearer <GCS_OAUTH_TOKEN was here>
< HTTP/1.1 400 Bad Request
< X-GUploader-UploadID: <snip>
< Content-Type: application/xml; charset=UTF-8
< Content-Length: 266
< Date: Wed, 27 Nov 2019 13:09:53 GMT
< Expires: Wed, 27 Nov 2019 13:09:53 GMT
< Cache-Control: private, max-age=0
< Server: UploadServer
< Alt-Svc: quic=":443"; ma=2592000; v="46,43",h3-Q050=":443"; ma=2592000,h3-Q049=":443"; ma=2592000,h3-Q048=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000
According to the GCE documentation (https://cloud.google.com/storage/docs/json_api/v1/parameters), I need an extra header. Adding this custom header to a manual curl request is successful.
curl -H "X-Goog-User-Project: payingproject" -H "Authorization: Bearer $(gcloud auth print-access-token)" https://bucket.storage-download.googleapis.com/file.bam
How can I inject a custom HTTP header into the htslib GCE request?
I see three potential approaches to fixing this:
- Adding a full GCE API to htslib. Seems overkill.
- Allowing arbitrary HTTP headers to be injected in the GCE request
- Document some libcurl configuration magic that will add a HTTP header to all libcurl requests (e.g. environment variables or a .netrc equivalent for headers). I couldn't find anything but that doesn't mean it's not there.
Increasing hts_verbose
is indeed the way to debug this; htsfile -v
can do this conveniently, as can samtools --verbosity INT
on develop.
See also #346 for thoughts on configuring S3 user-pays via HTSlib's S3 configuration file and/or variables. As so often, there is a branch to dust off (after the current dust has settled). IMHO it would be good for hfile_gcs.c to be able to generate user-pays headers similarly.
For others encountering this issue: a workaround is to mount the bucket using gcs-fuse so it presents as a normal file system.
I needed access to a Google Cloud Platform (GCP) storage bucket containing a large number of CRAMs that had the requester pays feature enabled. I found using gcsfuse
on this large bucket had a sluggish performance. I added a GCS requester pays feature to htslib to help resolve my immediate needs.
I'm not sure if this meets the community standards for this feature, but I'm willing to amend it as necessary. Any feedback would be appreciated!
@indraniel @d-cameron, I tried using the GCS_REQUESTER_PAYS_PROJECT=my-project-name
option but I end up with the same error: Invalid argument
. I tried with project-name and project-id with same result.
The last lines when using the --verbosity option are:
authorization: Bearer < my GCS_OAUTH_TOKEN >
* Connection state changed (MAX_CONCURRENT_STREAMS == 100)!
< HTTP/2 400
< x-guploader-uploadid: ADPycdu-ouZqyc1jYHhl__xxxxx
< content-type: application/xml; charset=UTF-8
< content-length: 266
< date: Tue, 03 Aug 2021 16:25:01 GMT
< expires: Tue, 03 Aug 2021 16:25:01 GMT
< cache-control: private, max-age=0
< server: UploadServer
< alt-svc: h3=":443"; ma=2592000,h3-29=":443"; ma=2592000,h3-T051=":443"; ma=2592000,h3-Q050=":443"; ma=2592000,h3-Q046=":443"; ma=2592000,h3-Q043=":443"; ma=2592000,quic=":443"; ma=2592000; v="46,43"
<
* Connection #0 to host w01010646.storage-download.googleapis.com left intact
[main_samview] fail to read the header from "gs://w01010646/XX.cram".
It is a storage bucket with the requester pays feature enabled too (I guess maybe we are talking about the same/similar one). Did you manage to use the samtools view
directly?
@migrau. If you want us to look at the problem could you open a new issue, This one is from 2019.