hail icon indicating copy to clipboard operation
hail copied to clipboard

[query][qob] implement requester pays in GoogleStorageFS

Open danking opened this issue 1 year ago • 11 comments

This enables Query-on-Batch pipelines to read from requester pays buckets.

@tpoterba curious for your thoughts on the flag situation. I suspect this PR will induce the Australians to start including requester pays config in their pipelines. If you describe an API you like, I can implement it for this PR.

Otherwise, I think this is ready. It works, it is tested. The changes to GoogleStorageFS suck, but its due to the reality of the GCS API.

danking avatar Aug 25 '22 12:08 danking

Also, I'm gonna add parameters to hl.init and set the flags in there. That punts the interface decision down the road by slightly restricting users (you can't change user project mid-pipeline).

danking avatar Aug 29 '22 15:08 danking

OK, I don't love specifying it in hl.init, but that is implemented here. Assuming the tests pass, then this should work and it also means that users don't have to restart the Dataproc clusters to turn on requester pays, they can hl.stop and hl.init.

danking avatar Aug 29 '22 16:08 danking

bump

danking avatar Sep 08 '22 21:09 danking

disregard this still seems broken.

danking avatar Sep 08 '22 21:09 danking

The layers of wtf really seem to have no end here.

Hadoop at least appears to include the configuration in the cache key for its FileSystem cache, but it is actually just ignored by the constructor. Ergo, even if you stop the Hail context and try to start a new hail context with a new Hadoop Configuration, you'll get a filesystem configured by the first configuration.

I'm looking for a way around this now.

danking avatar Sep 08 '22 23:09 danking

OK, I added some code to empty the cache when we hl.stop.

danking avatar Sep 08 '22 23:09 danking

The failure here:

E           org.apache.spark.SparkException: Job aborted due to stage failure: Task 6 in stage 2.0 failed 1 times, most recent failure: Lost task 6.0 in stage 2.0 (TID 8) (hostname-c5956f6f02 executor driver): java.io.EOFException: Invalid seek offset: position value (6) must be between 0 and 6 for 'gs://hail-services-requester-pays/hello'
E           	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.validatePosition(GoogleCloudStorageReadChannel.java:665)
E           	at com.google.cloud.hadoop.repackaged.gcs.com.google.cloud.hadoop.gcsio.GoogleCloudStorageReadChannel.position(GoogleCloudStorageReadChannel.java:546)
E           	at com.google.cloud.hadoop.fs.gcs.GoogleHadoopFSInputStream.seek(GoogleHadoopFSInputStream.java:178)
E           	at org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStream.java:65)

I hit this same error in Avro/GVS work recently -- I think the Google Hadoop API connector is wrong in that you cannot seek to the end of a file (N where N is the number of bytes in the file).

tpoterba avatar Sep 13 '22 15:09 tpoterba

ripping my hair out

danking avatar Sep 13 '22 15:09 danking

Truly a cursed PR. I added a new file with 10 rows so we don't have any empty partitions.

danking avatar Sep 16 '22 16:09 danking

FWIW, Hadoop documents this behavior even though it seems out of step with other FS implementations like FileInputStream. I can't find documentation on what happens in, say, Linux, but the internet suggests it is fine.

danking avatar Sep 16 '22 16:09 danking

Seeking to end of file is valid in multiple other implementations (checked Rust and CPython)

tpoterba avatar Sep 16 '22 17:09 tpoterba