hail
hail copied to clipboard
[query][qob] implement requester pays in GoogleStorageFS
This enables Query-on-Batch pipelines to read from requester pays buckets.
@tpoterba curious for your thoughts on the flag situation. I suspect this PR will induce the Australians to start including requester pays config in their pipelines. If you describe an API you like, I can implement it for this PR.
Otherwise, I think this is ready. It works, it is tested. The changes to GoogleStorageFS suck, but its due to the reality of the GCS API.
Also, I'm gonna add parameters to hl.init
and set the flags in there. That punts the interface decision down the road by slightly restricting users (you can't change user project mid-pipeline).
OK, I don't love specifying it in hl.init
, but that is implemented here. Assuming the tests pass, then this should work and it also means that users don't have to restart the Dataproc clusters to turn on requester pays, they can hl.stop
and hl.init
.
bump
disregard this still seems broken.
The layers of wtf really seem to have no end here.
Hadoop at least appears to include the configuration in the cache key for its FileSystem cache, but it is actually just ignored by the constructor. Ergo, even if you stop the Hail context and try to start a new hail context with a new Hadoop Configuration, you'll get a filesystem configured by the first configuration.
I'm looking for a way around this now.
OK, I added some code to empty the cache when we hl.stop
.
The failure here:
E org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 2.0 failed 1 times, most recent failure: Lost task 6.0 in stage 2.0 (TID 8) (hostname-c5956f6f02 executor driver): java.io.EOFException: Invalid seek offset: position value (6) must be between 0 and 6 for 'gs://hail-services-requester-pays/hello'
E at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.validatePosition(GoogleCloudStorageReadChannel.java:665)
E at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.position(GoogleCloudStorageReadChannel.java:546)
E at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.seek(GoogleHadoopFSInputStream.java:178)
E at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)
I hit this same error in Avro/GVS work recently -- I think the Google Hadoop API connector is wrong in that you cannot seek to the end of a file (N where N is the number of bytes in the file).
Truly a cursed PR. I added a new file with 10 rows so we don't have any empty partitions.
FWIW, Hadoop documents this behavior even though it seems out of step with other FS implementations like FileInputStream. I can't find documentation on what happens in, say, Linux, but the internet suggests it is fine.
Seeking to end of file is valid in multiple other implementations (checked Rust and CPython)