Fix submission of Dataflow jobs
Today, if you try to submit dataflow jobs with a mix of normal pipeline options and dataflow pipeline options, it does not get submitted correctly.
This is because cfg.command is quoted, so the final command ends up looking like this after adding dataflow_flags:
'
python3 -m apache_beam.examples.wordcount --input=gs://dataflow-samples/shakespeare/kinglear.txt --output=gs://ttl-30d-us-central2/axlearn/users/remyw/dataflow/wordcount' --dataflow_service_options=enable_google_cloud_heap_sampling --dataflow_service_options=enable_secure_boot --experiments=use_network_tags=allow-internet-egress --experiments=use_runner_v2 --machine_type=n2-standard-8 --no_use_public_ips --project=abc --region=us-central1 --runner=DataflowRunner --sdk_container_image=my_container
Note the leading quote as well as the trailing quote after output=gs://ttl-30d-us-central2/axlearn/users/remyw/dataflow/wordcount'
This breaks the processing of this command, and all the subsequent dataflow_flags are ignored, so it gets run locally instead of on Dataflow.
To fix this, we just need to strip the quotes around cfg.command before adding it to our full command.
Thanks! I reformatted to fix the precommit check since it looks like it was failing
Closing this PR due to inactivity. Please re-open or file a new PR if this is still important.
I don't have permissions to reopen this PR, but I think it is still valid - @Ethanlm would you mind taking a look? It does have multiple approvals already
Let me ping @zhiyun to take a look. Thank you!
Hey @zhiyun did you have a chance to take a look at this?
This pull request has been automatically marked as stale because it has been inactive for 60 days. It will be closed in 7 days if no further activity occurs. If you would like to continue working on this, please remove the stale label or leave a comment.
This should still be valid