azureml-examples
azureml-examples copied to clipboard
Using PipelineParameter to set the output file path in OutputFileDatasetConfig for each new run against a PublishedPipeline
What example? Describe it
How can we dynamically change the output file path of OutputFileDatasetConfig
for a PublishedPipeline
, since we cannot set variables at the time of publishing a definition. E.g. This would not work OutputFileDatasetConfig(name="processed_data", destination=(datastore, f"mypath/{today}/{output-dataset-name}")).as_upload()
because there's no way to update the parameterized file path when you submit a run against that PublishedPipeline later on.
PipelineParameters
, you say?! Well, when using a PipelineParameter to try to set the output path at runtime, we get something that looks like this issue.
So, this does not work:
# Create dataset output path from pipeline param so it can be changed at runtime
data_output_path_pipeline_param = PipelineParameter(
name="data_output_path",
default_value='default_value'
)
# Create output dataset
output_data = OutputFileDatasetConfig(
name="dataset_output",
destination=( Datastore(workspace, name='datastore_name'), data_output_path_pipeline_param)
).as_upload(overwrite=True)
### ignoring inputs here for brevity ###
### Create pipeline step ###
# Pass input dataset into step1 and upload output to data_output_path_pipeline_param destination
step1 = PythonScriptStep(
script_name="script.py", # doesn't matter what this does
source_directory="src/",
name="Step 1",
arguments=["--dataset-name", input_dataset_name_pipeline_param],
inputs=[tabular_ds_consumption],
outputs=[output_data]
)
### ignoring publishing here for brevity ###
# Passing in the data output path we want to use for the run
experiment.submit(
published_pipeline, # use the pipeline we defined and published earlier
pipeline_parameters={"input_dataset_name": dataset_name,
"data_output_path": f"base_data_pull/{dataset_name}/{today}/{dataset_name}.parquet"
}
)
because the pipeline just "uploads" the OutputFileDataset to a bogus "PipelineParameter_Name:data_output_path_Default:base_data_pull/{dataset_name}/{today}/{dataset_name}.parquet" on the datastore.
TLDR
How should we pass a PipelineParameter into a Pipeline so that after it's published and we want to submit a run for the PublishPipeline, we can pass exactly where we want the output data to land on a datastore (ADLS or Blob) for that run.
If this isn't the right way to use PipelineParameter, what is in order to get the intended behavior?
@caitriggs have you found a work around. I just hit the same issue.