azureml-examples icon indicating copy to clipboard operation
azureml-examples copied to clipboard

Using PipelineParameter to set the output file path in OutputFileDatasetConfig for each new run against a PublishedPipeline

Open caitriggs opened this issue 2 years ago • 1 comments

What example? Describe it

How can we dynamically change the output file path of OutputFileDatasetConfig for a PublishedPipeline, since we cannot set variables at the time of publishing a definition. E.g. This would not work OutputFileDatasetConfig(name="processed_data", destination=(datastore, f"mypath/{today}/{output-dataset-name}")).as_upload() because there's no way to update the parameterized file path when you submit a run against that PublishedPipeline later on.

PipelineParameters, you say?! Well, when using a PipelineParameter to try to set the output path at runtime, we get something that looks like this issue.

So, this does not work:

# Create dataset output path from pipeline param so it can be changed at runtime
data_output_path_pipeline_param = PipelineParameter(
    name="data_output_path",
    default_value='default_value'
)

# Create output dataset 
output_data = OutputFileDatasetConfig(
    name="dataset_output",
    destination=( Datastore(workspace, name='datastore_name'), data_output_path_pipeline_param)
).as_upload(overwrite=True)

### ignoring inputs here for brevity ###

### Create pipeline step ###

# Pass input dataset into step1 and upload output to data_output_path_pipeline_param destination
step1 = PythonScriptStep(
    script_name="script.py", # doesn't matter what this does
    source_directory="src/",
    name="Step 1", 
    arguments=["--dataset-name", input_dataset_name_pipeline_param],
    inputs=[tabular_ds_consumption],
    outputs=[output_data]
)

### ignoring publishing here for brevity ###

# Passing in the data output path we want to use for the run
experiment.submit(
    published_pipeline, # use the pipeline we defined and published earlier
    pipeline_parameters={"input_dataset_name": dataset_name, 
                                        "data_output_path": f"base_data_pull/{dataset_name}/{today}/{dataset_name}.parquet"
    }
)

because the pipeline just "uploads" the OutputFileDataset to a bogus "PipelineParameter_Name:data_output_path_Default:base_data_pull/{dataset_name}/{today}/{dataset_name}.parquet" on the datastore.

TLDR

How should we pass a PipelineParameter into a Pipeline so that after it's published and we want to submit a run for the PublishPipeline, we can pass exactly where we want the output data to land on a datastore (ADLS or Blob) for that run.

If this isn't the right way to use PipelineParameter, what is in order to get the intended behavior?

caitriggs avatar Mar 31 '22 00:03 caitriggs

@caitriggs have you found a work around. I just hit the same issue.

rubberduck203 avatar May 17 '22 14:05 rubberduck203