dataflow-runner issues

Support IAM roles

1

See https://github.com/snowplow/snowplow/issues/1263

Consider supporting spot instances

2

The situation has been getting better wrt Spark jobs running on spot instances in EMR recently (https://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html) so it might be interesting to support them.

BenFradet

Support temporary cluster-id file for transient cluster

1

Instead of writing down id of created cluster - we can do following: 1. After `up` command create a `.cluster-id` file containing obviously cluster id in current directory 2. If...

chuwy

Add jobflow name-based locking

2

This is a poor man's #17 - you use the EMR jobflow/cluster name to prevent multiple jobs from running at the same time. The jobflow will refuse to start if...

alexanderdean

Add ability to auto-discover cluster

3

This would be based on name and age where rather than passing the jobflow ID you would pass the jobflow name and dataflow-runner would select the newest cluster available that...

jbeemster

Add github templates

BenFradet

Long playbook (9 steps+) run-transient mode issue

I have observed (replicable issue) while running dataflow runner in `run-transient` with playbooks longer than 8 steps. The error one may observe is _400: Throughput exceeded_ issue. Once the error...

grzegorzewald

Add EMR cluster resize command

1

This command would check whether the currently running cluster spec is the same as what is specified in the config. If not it would trigger a resize. This would let...

jbeemster

Migrate to goavro

1

The library we're using to handle avro records (https://github.com/elodina/go-avro) hasn't seen any activity for almost two years. Furthermore, https://github.com/linkedin/goavro seems to be more feature-complete.

BenFradet

Explore options around specifying ways to react to step failures

11

See discussion in #11

BenFradet

dataflow-runner
dataflow-runner copied to clipboard

Metadata

Support IAM roles

Consider supporting spot instances

Support temporary cluster-id file for transient cluster

Add jobflow name-based locking

Add ability to auto-discover cluster

Add github templates

Long playbook (9 steps+) run-transient mode issue

Add EMR cluster resize command

Migrate to goavro

Explore options around specifying ways to react to step failures

← Metadata

Owner

Metadata

dataflow-runner dataflow-runner copied to clipboard

Metadata

← Metadata

Owner

Metadata

dataflow-runner
dataflow-runner copied to clipboard