MachineLearningNotebooks icon indicating copy to clipboard operation
MachineLearningNotebooks copied to clipboard

ParallelRunStep on Intermediate Partitioned File Dataset Failing

Open jtisbell4 opened this issue 3 years ago • 4 comments

Hi,

I am attempting to use ParallelRunStep for a batch training job. The data that is being ingested for training is a partitioned file dataset, so I am using this example notebook as a template.

In my script, the batch training step is preceded by a data pulling step, so the input to the training step is of type OutputDatasetConfig rather than Dataset.File.from_files, which is how it is done in the example notebook. Because of this, the training step fails with the following error:

azureml_common.parallel_run.exception_info.Exception: Run failed. Below is the error detail: Traceback (most recent call last): File "/mnt/batch/tasks/shared/LS_root/jobs/sdpesp-dev-cvx/azureml/7d68a66e-aa64-4732-a159-22d477774715/wd/azureml/7d68a66e-aa64-4732-a159-22d477774715/driver/simulator.py", line 93, in main simulator.wait() File "/mnt/batch/tasks/shared/LS_root/jobs/sdpesp-dev-cvx/azureml/7d68a66e-aa64-4732-a159-22d477774715/wd/7d68a66e-aa64-4732-a159-22d477774715/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/masterless_simulator.py", line 122, in wait ProgressReport().save() File "/mnt/batch/tasks/shared/LS_root/jobs/sdpesp-dev-cvx/azureml/7d68a66e-aa64-4732-a159-22d477774715/wd/7d68a66e-aa64-4732-a159-22d477774715/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/progress_report.py", line 191, in save task_exporter.save() File "/mnt/batch/tasks/shared/LS_root/jobs/sdpesp-dev-cvx/azureml/7d68a66e-aa64-4732-a159-22d477774715/wd/7d68a66e-aa64-4732-a159-22d477774715/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/task_exporter.py", line 113, in save self.save_remaining() File "/mnt/batch/tasks/shared/LS_root/jobs/sdpesp-dev-cvx/azureml/7d68a66e-aa64-4732-a159-22d477774715/wd/7d68a66e-aa64-4732-a159-22d477774715/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/task_exporter.py", line 100, in save_remaining for task in total_tasks: File "/mnt/batch/tasks/shared/LS_root/jobs/sdpesp-dev-cvx/azureml/7d68a66e-aa64-4732-a159-22d477774715/wd/7d68a66e-aa64-4732-a159-22d477774715/prs_prod/lib/python3.8/site-packages/azureml_sys/parallel_run/partition_by_keys_provider.py", line 70, in get_tasks DatasetHelper().save(dataset) File "/mnt/batch/tasks/shared/LS_root/jobs/sdpesp-dev-cvx/azureml/7d68a66e-aa64-4732-a159-22d477774715/wd/azureml/7d68a66e-aa64-4732-a159-22d477774715/driver/azureml_common/parallel_run/dataset_helper.py", line 29, in save self.logger.info("Dump dataset {} as preppy files to local directory.".format(dataset.id)) AttributeError: 'str' object has no attribute 'id'

Could this possibly be an issue with ParallelRunStep being unable to accept a partitioned OutputDatasetConfig as an input? Thanks in advance!


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

jtisbell4 avatar Nov 29 '21 15:11 jtisbell4

Having the same issue with creating a pipeline using a ParallelRunStep. Cannot use an output from previous step as input to ParallelRunStep as documented in this issue

caitriggs avatar Dec 01 '22 22:12 caitriggs

@jtisbell4 what work around did you end up using?

caitriggs avatar Dec 02 '22 17:12 caitriggs

To consume the output as an input in subsequent pipeline steps, you may need to call as_input() https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.data.output_dataset_config.outputdatasetconfig?view=azure-ml-py

shift202 avatar Jan 05 '23 12:01 shift202

To consume the output as an input in subsequent pipeline steps, you may need to call as_input() https://learn.microsoft.com/en-us/python/api/azureml-core/azureml.data.output_dataset_config.outputdatasetconfig?view=azure-ml-py

this isn't the problem in this case. Even if you did add .as_input() to the OutputTabularDatasetConfig when passing it into the subsequent ParallelRunStep, you still need to pass that OutputConfig in as an already partitioned dataset - TabularDataset.partition_by() - but you cannot call .partition_by() on an OutputConfig type object

image

    inputs=[combined_scored_dataset.read_delimited_files().partition_by(partition_keys=['model_name'], target=DataPath(default_aml_env.get_datastore(), "partitioned_datasets"))],    
AttributeError: 'OutputTabularDatasetConfig' object has no attribute 'partition_by'

caitriggs avatar Feb 11 '23 18:02 caitriggs