metaflow Add support for GCP, starting with GCS

This PR adds support for GCS as a datastore using the GCS Python client (no optimizations on top of the vanilla client). All of the patterns adopted here were taken directly from the S3 client, including retries, filepaths and "doneness".

I previously attempted to integrate the existing S3 boto client with GCS using HMAC, but unfortunately, the boto client does multi-part uploads which is not natively supported by GCS and this breaks on uploads larger than 150mb.

Related issue: https://github.com/Netflix/metaflow/issues/22

Mar 06 '20 02:03 jaychia

@jaychia Thanks for the PR! Can you add to the PR testing strategy and any performance analysis? Also, do you intend to contribute the rest of the pieces around compute/orchestration for GCP?

Mar 06 '20 17:03 savingoyal

@savingoyal

Benchmarking: do we have existing benchmark FlowSpecs that I can run on the sandbox if I can be granted access to one? I can benchmark this against running a job in a GCE instance.
Testing: I don't see any unit tests for the existing S3 datastore, but I am happy to run through a suite of tests that we can agree on?

Running a flow end-to-end with branching, fanout and random error logic
Loading each artifact from the run, from each step, from each task
Loading stdout/stderr from each step
Inject random error logic in the client during upload/download/authentication
Test resuming a run after failure

Feel free to add anything that I may be missing.

Compute/Orchestration: Currently I run everything locally on a large instance in GCP, within a Kubernetes cluster actually.

GCP doesn't have any solution that I'm convinced can be as good of a solution as AWS batch to get us to a "serverless computing" level of integration. Some proposed solutions are quite complicated and require chaining together a bunch of GCP services (Cloud Scheduler -> Pub/Sub -> Cloud Functions -> Spin up GCE VM -> Run batch workload -> Spin VM back down). There also is Cloud Run but it seems to be meant for a request-response workload rather than batch jobs.

I might argue that we could start providing Kubernetes batch workload operators that can work on Kubernetes, with something like Volcano which uses kube-batch (also used by KubeFlow) under the hood. Spinning up a Kubernetes cluster in GCP with GKE is very simple (takes literally a single click), and with some appropriate node pool configurations this might be the lowest hanging fruit for getting compute up on GCP.

Happy to hear your thoughts also.

Mar 06 '20 18:03 jaychia

@jaychia: Is the Datastore Stable? I have built a Kubernetes Plugin on my version of the Metaflow's Fork. I have not used ur plugin yet, But you can try it out with GKE and GCP's Datastore, using both our plugins. Currently, my plugin is tied to S3 at the KubeDecorator.task_finished part. We can modify that to support GKE too. It's extensible enough to support your datastore.

Though reading through your code regarding the datastore, I noticed that you have not changed any code with distributed exec components like Batch decorator in Metaflow. If you are integrating a datastore, one will have to integrate its support for Metaflow's distributed capabilities right?

Mar 10 '20 17:03 valayDave

@jaychia: Is the Datastore Stable? I have built a Kubernetes Plugin on my version of the Metaflow's Fork. I have not used ur plugin yet, But you can try it out with GKE and GCP's Datastore, using both our plugins. Currently, my plugin is tied to S3 at the KubeDecorator.task_finished part. We can modify that to support GKE too. It's extensible enough to support your datastore.

Though reading through your code regarding the datastore, I noticed that you have not changed any code with distributed exec components like Batch decorator in Metaflow. If you are integrating a datastore, one will have to integrate its support for Metaflow's distributed capabilities right?

The datastore is stable, I've been using it for a few weeks now for fairly large workflows (~100 tasks per flow), but run locally from a large GCE instance.

Yes, my intention was to support only things within the GCP ecosystem which was why I didn't do any AWS batch integration. From a code perspective, I think it will be cleaner for the distributed exec components to make use of the Datastore API anyways (I.e. they should be datastore-backend-agnostic).

I'll go ahead with testing and benchmarking this sometime this week, if @savingoyal can grant me sandbox access? I just requested for access :)

I also intended to contribute the distributed execution functionality in a separate PR - your fork seems to be a great place to start though!

Mar 10 '20 19:03 jaychia

This is awesome. Just wanted to check in on the status and if this is still something you're looking to integrate in the near future. @jaychia

May 13 '20 04:05 danielduhh

This is awesome. Just wanted to check in on the status and if this is still something you're looking to integrate in the near future. @jaychia

Yup! You can check on the progress here: https://github.com/freenome/metaflow

We've been talking with the folks at Netflix and want to push this out together with the PR for Kubernetes support which should integrate pretty seamlessly with GKE, so that we have complete GCP support from the datastore to the execution backend.

May 13 '20 04:05 jaychia

Hello guys, what is the status of the GCS integration? Are you still pushing for it?