input data to spark job is not passed to job
Operating System
Linux
Version Information
az --version ml 2.28.0
Steps to reproduce
-
I am trying to submit a spark job as shown in https://github.com/Azure/azureml-examples/tree/main/cli/jobs/spark
-
The data has been uploaded as described here: https://github.com/Azure/azureml-examples/blob/main/cli/jobs/spark/data/README.md
-
This is the yml definition (partially shown)
$schema: https://azuremlschemas.azureedge.net/latest/sparkJob.schema.json
resources:
instance_type: standard_e4s_v3
runtime_version: "3.3"
type: spark
conf:
spark.driver.cores: 1
spark.driver.memory: 2g
spark.executor.cores: 2
spark.executor.memory: 2g
spark.executor.instances: 2
inputs:
input_data_step1:
type: uri_file
path: azureml://datastores/workspaceblobstorex/paths/data/titanic.csv
mode: direct
args: >-
--input_data_step1 ${{inputs.input_data_step1}}
-
The overview in the AML Studio shows the correct input data and I can navigate to it.
-
Now the issue: Inside my job which runs without problems the input data argument is expanded to something like: /mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1724844698277_0001/container_1724844698277_0001_01_000001/azureml:/subscriptions/xxxxxx/resourcegroups/weg-aml-v2/workspaces/weg-aml-v2/datastores/workspaceblobstore/paths/data/titanic.csv
However this path does not exist. There is no indication in the logs what failed.
Expected behavior
The input data should be passed to the spark application, such that it can be accessed.
Actual behavior
The issue is that the input data is not passed to the spark application.
Inside the spark application which runs without problems the input data argument is expanded to something like: /mnt/var/hadoop/tmp/nm-local-dir/usercache/trusted-service-user/appcache/application_1724844698277_0001/container_1724844698277_0001_01_000001/azureml:/subscriptions/xxxxxx/resourcegroups/weg-aml-v2/workspaces/weg-aml-v2/datastores/workspaceblobstore/paths/data/titanic.csv
However this path does not exist. There is no indication in the logs what failed.
Addition information
No response
Please note in the yml above I am pointing to 'workspaceblobstorex' (x at the end!)...
This does not even exist, but the are no complaints and the job runs through.
With the correct and existing datastore 'workspaceblobstore' it behaves the same as reported above.