azureml-examples Using PipelineParameter to set the output file path in OutputFileDatasetConfig for each new run against a PublishedPipeline

Using PipelineParameter to set the output file path in OutputFileDatasetConfig for each new run against a PublishedPipeline

Open caitriggs opened this issue 2 years ago • 1 comments

What example? Describe it

How can we dynamically change the output file path of OutputFileDatasetConfig for a PublishedPipeline, since we cannot set variables at the time of publishing a definition. E.g. This would not work OutputFileDatasetConfig(name="processed_data", destination=(datastore, f"mypath/{today}/{output-dataset-name}")).as_upload() because there's no way to update the parameterized file path when you submit a run against that PublishedPipeline later on.

PipelineParameters, you say?! Well, when using a PipelineParameter to try to set the output path at runtime, we get something that looks like this issue.

So, this does not work:

# Create dataset output path from pipeline param so it can be changed at runtime
data_output_path_pipeline_param = PipelineParameter(
    name="data_output_path",
    default_value='default_value'
)

# Create output dataset 
output_data = OutputFileDatasetConfig(
    name="dataset_output",
    destination=( Datastore(workspace, name='datastore_name'), data_output_path_pipeline_param)
).as_upload(overwrite=True)

### ignoring inputs here for brevity ###

### Create pipeline step ###

# Pass input dataset into step1 and upload output to data_output_path_pipeline_param destination
step1 = PythonScriptStep(
    script_name="script.py", # doesn't matter what this does
    source_directory="src/",
    name="Step 1", 
    arguments=["--dataset-name", input_dataset_name_pipeline_param],
    inputs=[tabular_ds_consumption],
    outputs=[output_data]
)

### ignoring publishing here for brevity ###

# Passing in the data output path we want to use for the run
experiment.submit(
    published_pipeline, # use the pipeline we defined and published earlier
    pipeline_parameters={"input_dataset_name": dataset_name, 
                                        "data_output_path": f"base_data_pull/{dataset_name}/{today}/{dataset_name}.parquet"
    }
)

because the pipeline just "uploads" the OutputFileDataset to a bogus "PipelineParameter_Name:data_output_path_Default:base_data_pull/{dataset_name}/{today}/{dataset_name}.parquet" on the datastore.

TLDR

How should we pass a PipelineParameter into a Pipeline so that after it's published and we want to submit a run for the PublishPipeline, we can pass exactly where we want the output data to land on a datastore (ADLS or Blob) for that run.

If this isn't the right way to use PipelineParameter, what is in order to get the intended behavior?

Mar 31 '22 00:03 caitriggs

@caitriggs have you found a work around. I just hit the same issue.

May 17 '22 14:05 rubberduck203

azureml-examples azureml-examples copied to clipboard

Using PipelineParameter to set the output file path in OutputFileDatasetConfig for each new run against a PublishedPipeline

What example? Describe it

TLDR

azureml-examples
azureml-examples copied to clipboard