MachineLearningNotebooks
MachineLearningNotebooks copied to clipboard
ParallelRunStep on Intermediate Partitioned File Dataset Failing
Hi,
I am attempting to use ParallelRunStep
for a batch training job. The data that is being ingested for training is a partitioned file dataset, so I am using this example notebook as a template.
In my script, the batch training step is preceded by a data pulling step, so the input
to the training step is of type OutputDatasetConfig
rather than Dataset.File.from_files
, which is how it is done in the example notebook. Because of this, the training step fails with the following error:
azureml_common.parallel_run.exception_info.Exception: Run failed. Below is the error detail: Traceback (most recent call last): File "/mnt/batch/tasks/shared/LS_root/jobs/sdpesp-dev-cvx/azureml/7d68a66e-aa64-4732-a159-22d477774715/wd/azureml/7d68a66e-aa64-4732-a159-22d477774715/driver/simulator.py", line 93, in main simulator.wait() File "/mnt/batch/tasks/shared/LS_root/jobs/sdpesp-dev-cvx/azureml/7d68a66e-aa64-4732-a159-22d477774715/wd/7d68a66e-aa64-4732-a159-22d477774715/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/masterless_simulator.py", line 122, in wait ProgressReport().save() File "/mnt/batch/tasks/shared/LS_root/jobs/sdpesp-dev-cvx/azureml/7d68a66e-aa64-4732-a159-22d477774715/wd/7d68a66e-aa64-4732-a159-22d477774715/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/progress_report.py", line 191, in save task_exporter.save() File "/mnt/batch/tasks/shared/LS_root/jobs/sdpesp-dev-cvx/azureml/7d68a66e-aa64-4732-a159-22d477774715/wd/7d68a66e-aa64-4732-a159-22d477774715/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/task_exporter.py", line 113, in save self.save_remaining() File "/mnt/batch/tasks/shared/LS_root/jobs/sdpesp-dev-cvx/azureml/7d68a66e-aa64-4732-a159-22d477774715/wd/7d68a66e-aa64-4732-a159-22d477774715/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/task_exporter.py", line 100, in save_remaining for task in total_tasks: File "/mnt/batch/tasks/shared/LS_root/jobs/sdpesp-dev-cvx/azureml/7d68a66e-aa64-4732-a159-22d477774715/wd/7d68a66e-aa64-4732-a159-22d477774715/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/partition_by_keys_provider.py", line 70, in get_tasks DatasetHelper().save(dataset) File "/mnt/batch/tasks/shared/LS_root/jobs/sdpesp-dev-cvx/azureml/7d68a66e-aa64-4732-a159-22d477774715/wd/azureml/7d68a66e-aa64-4732-a159-22d477774715/driver/azureml_common/parallel_run/dataset_helper.py", line 29, in save self.logger.info("Dump dataset {} as preppy files to local directory.".format(dataset.id)) AttributeError: 'str' object has no attribute 'id'
Could this possibly be an issue with ParallelRunStep
being unable to accept a partitioned OutputDatasetConfig
as an input
? Thanks in advance!
Document Details
⚠ Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.
- ID: f69044d5-213e-a764-31dd-24f8368212b7
- Version Independent ID: 23d38b1c-974a-b2fc-332a-70d7500e1751
- Content: azureml.pipeline.steps.ParallelRunStep class - Azure Machine Learning Python
- Content Source: AzureML-Docset/stable/docs-ref-autogen/azureml-pipeline-steps/azureml.pipeline.steps.ParallelRunStep.yml
- Service: machine-learning
- Sub-service: core
- GitHub Login: @DebFro
- Microsoft Alias: debfro
Having the same issue with creating a pipeline using a ParallelRunStep. Cannot use an output from previous step as input to ParallelRunStep as documented in this issue
@jtisbell4 what work around did you end up using?
To consume the output as an input in subsequent pipeline steps, you may need to call as_input() https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.data.output_dataset_config.outputdatasetconfig?view=azure-ml-py
To consume the output as an input in subsequent pipeline steps, you may need to call as_input() https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.data.output_dataset_config.outputdatasetconfig?view=azure-ml-py
this isn't the problem in this case. Even if you did add .as_input() to the OutputTabularDatasetConfig when passing it into the subsequent ParallelRunStep, you still need to pass that OutputConfig in as an already partitioned dataset - TabularDataset.partition_by() - but you cannot call .partition_by() on an OutputConfig type object
inputs=[combined_scored_dataset.read_delimited_files().partition_by(partition_keys=['model_name'], target=DataPath(default_aml_env.get_datastore(), "partitioned_datasets"))],
AttributeError: 'OutputTabularDatasetConfig' object has no attribute 'partition_by'