MachineLearningNotebooks icon indicating copy to clipboard operation
MachineLearningNotebooks copied to clipboard

ML Pipelines DatabricksStep doesn't support Run.get_context()

Open RothNRK opened this issue 4 years ago • 18 comments

While using a DatabricksStep I want to get the appropriate Run context so that I can log information to Azure ML.

I thought this might work:

  1. Authenticate with AzureMLTokenAuthentication
  2. Get an authenticated Workspace -> Experiment -> Run
  3. Use the Run.get_context()

Although I can't use the AzureMLTokenAuthentication to authenticate the Workspace.

Can you please provide more information on Azure's intention on how AzureMLTokenAuthentication should even be used?

The only way I've been able to get a run_context from within Databricks during a DatabricksStep is:

os.environ['AZUREML_RUN_TOKEN'] = AZUREML_RUN_TOKEN
os.environ['AZUREML_RUN_TOKEN_EXPIRY'] = AZUREML_RUN_TOKEN_EXPIRY
os.environ['AZUREML_RUN_ID'] = AZUREML_RUN_ID
os.environ['AZUREML_ARM_SUBSCRIPTION'] = AZUREML_ARM_SUBSCRIPTION
os.environ['AZUREML_ARM_RESOURCEGROUP'] = AZUREML_ARM_RESOURCEGROUP
os.environ['AZUREML_ARM_WORKSPACE_NAME'] = AZUREML_ARM_WORKSPACE_NAME
os.environ['AZUREML_ARM_PROJECT_NAME'] = AZUREML_ARM_PROJECT_NAME
os.environ['AZUREML_SERVICE_ENDPOINT'] = AZUREML_SERVICE_ENDPOINT

run = Run.get_context(allow_offline=False)

Which feels like a hack. How does Azure suggest that someone does this.

Any help is appreciated.

Noel


Document Details

Do not edit this section. It is required for docs.microsoft.com ➟ GitHub issue linking.

RothNRK avatar Jun 17 '20 06:06 RothNRK

@RothNRK The sample notebook for authentication steps that can be used are available here. CLI or service principal authentication can be used to get your workspace details

@aashishb Could you please advise?

RohitMungi-MSFT avatar Jun 17 '20 10:06 RohitMungi-MSFT

@RohitMungi-MSFT Thank you for your response.

CLI Auth doesn't look like it meets my needs.

Using a Service Principal would work but it requires:

  1. A Service Principal.
  2. A key-vault mounted as a Secret Scope in Databricks. This would be good if it were my only option but Azure ML is kind enough to send info to make a secure connection and log to Azure ML using the method I outlined above (which is a bit of a hack but certainly less effort than the Service Principal method).

How do we meaningfully use AzureMLTokenAuthentication (The notebook you linked doesn't include an example with AzureMLTokenAuthentication).

Thank you

RothNRK avatar Jun 17 '20 12:06 RothNRK

@RothNRK The AzureMLTokenAuthentication is intended as Azure ML internal API.

Run.get_context() should simply work when you're within a context of Azure ML run, without any extra authentication required.

rastala avatar Jun 19 '20 00:06 rastala

@rastala Thank you. That is what I expected but while in a DatabricksStep it doesn't seem to simply work.

Run.get_context(allow_offline=False) produces:

KeyError                                  Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/azureml/core/run.py in _load_scope(cls)
    215             # Load authentication scope environment variables
--> 216             subscription_id = os.environ['AZUREML_ARM_SUBSCRIPTION']
    217             run_id = os.environ["AZUREML_RUN_ID"]

/local_disk0/pythonVirtualEnvDirs/virtualEnv-6c0c6c97-5068-4f34-91f1-ded12db18057/lib/python3.7/os.py in __getitem__(self, key)
    677             # raise KeyError with the original key value
--> 678             raise KeyError(key) from None
    679         return self.decodevalue(value)

KeyError: 'AZUREML_ARM_SUBSCRIPTION'

The above exception was the direct cause of the following exception:

RunEnvironmentException                   Traceback (most recent call last)
/databricks/python/lib/python3.7/site-packages/azureml/core/run.py in get_context(cls, allow_offline, used_for_context_manager, **kwargs)
    291         try:
--> 292             experiment, run_id = cls._load_scope()
    293 

/databricks/python/lib/python3.7/site-packages/azureml/core/run.py in _load_scope(cls)
    232         except KeyError as key_error:
--> 233             raise_from(RunEnvironmentException(), key_error)
    234         else:

/databricks/python/lib/python3.7/site-packages/six.py in raise_from(value, from_value)

RunEnvironmentException: RunEnvironmentException:
	Message: Could not load a submitted run, if outside of an execution context, use experiment.start_logging to initialize an azureml.core.Run.
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Could not load a submitted run, if outside of an execution context, use experiment.start_logging to initialize an azureml.core.Run."
    }
}

During handling of the above exception, another exception occurred:

RunEnvironmentException                   Traceback (most recent call last)
<command--1> in <module>
     13 
     14 with open(filename, "rb") as f:
---> 15   exec(f.read())
     16 

<string> in <module>

<string> in main()

<string> in _main(args)

<string> in _get_run_context(args)

<string> in _get_run()

/databricks/python/lib/python3.7/site-packages/azureml/core/run.py in [0;36mget_context(cls, allow_offline, used_for_context_manager, **kwargs)
    303             else:
    304                 module_logger.debug("Could not load the run context and allow_offline set to False")
--> 305                 raise RunEnvironmentException(inner_exception=ex)
    306 
    307     @classmethod

RunEnvironmentException: RunEnvironmentException:
	Message: Could not load a submitted run, if outside of an execution context, use experiment.start_logging to initialize an azureml.core.Run.
	InnerException RunEnvironmentException:
	Message: Could not load a submitted run, if outside of an execution context, use experiment.start_logging to initialize an azureml.core.Run.
	InnerException None
	ErrorResponse 
{
    "error": {
        "message": "Could not load a submitted run, if outside of an execution context, use experiment.start_logging to initialize an azureml.core.Run."
    }
}
	ErrorResponse 
{
    "error": {
        "message": "Could not load a submitted run, if outside of an execution context, use experiment.start_logging to initialize an azureml.core.Run."
    }
}

So it looks like it's looking for environment variables that aren't set. If I run it with Run.get_context() I get an error related to the _OfflineRun not working (which makes sense since I need a Run not an _OfflineRun).

RothNRK avatar Jun 23 '20 09:06 RothNRK

I am facing the same issue as above, mentioned by @RothNRK . Further I need parent pipeline run ID from the Run. If I run it with Run.get_context() I am getting an error with the _OfflineRun. Could you please point to an example of the same?

Thanks!

swathi-intelligent avatar Jun 28 '20 06:06 swathi-intelligent

Just a friendly reminder that this ticket still exists.

RothNRK avatar Jul 23 '20 15:07 RothNRK

@rastala , please check and create ICM for pipelines team if needed. Thanks!

aashishb avatar Jul 23 '20 15:07 aashishb

I'm facing the exact same problem as the original poster of this issue. Still couldn't get it to work.

brunocous avatar Aug 11 '20 10:08 brunocous

@RothNRK Thank you for pointing this out. I have created a work item to investigate the issue. We will update you shortly.

shbijlan avatar Sep 02 '20 20:09 shbijlan

@RothNRK you will need to setup environment variables using parameters passed to your script in your code before you can call Run.get_context(). This is not a hack but the way this is designed to work.

fahdkmsft avatar Sep 12 '20 00:09 fahdkmsft

@fahdkmsft It's unclear that this is the intended usage given that the only documentation related to these additional variables is under DatabricksStep -> python_script_name and makes no mention of setting them in order to get a Run through Run.get_context().

I apologize if I missed some docs that explains this but if I have can you please post a link to them?

I'd also like to point out that no one has answered my question about what the intended use of the AzureMLTokenAuthentication is? A DatabricksStep provides all of the information to instantiate this class but you can't authenticate a Workspace with it.

RothNRK avatar Sep 18 '20 13:09 RothNRK

@RothNRK The documentation for Run class is common to all types of runs including pipeline runs, hyperdrive runs and automl runs. Please refer to: https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.authentication.azuremltokenauthentication?view=azure-ml-py

For AzureMLTokenAuthentication, please refer to https://docs.microsoft.com/en-us/python/api/azureml-core/azureml.core.authentication.azuremltokenauthentication?view=azure-ml-py

shbijlan avatar Nov 11 '20 18:11 shbijlan

I am having the same issue as @RothNRK. I've created a pipeline with a DatabricksStep and I am unable to use Run.get_context() to even instantiate the run.

I'm using the following snippet of code similar RothNRK's and I can verify that each parameter is being set.

# This argument parsing is necessary for getting the Run Context
parameters = ['AZUREML_RUN_TOKEN', 'AZUREML_RUN_TOKEN_EXPIRY', 'AZUREML_RUN_ID', 'AZUREML_ARM_SUBSCRIPTION', 
              'AZUREML_ARM_RESOURCEGROUP', 'AZUREML_ARM_WORKSPACE_NAME', 'AZUREML_ARM_PROJECT_NAME', 
              'AZUREML_SERVICE_ENDPOINT' ]
for param in parameters:
  temp_value = dbutils.widgets.get(f"--{param}")
  print(f"Working on: {param} with vaule {temp_value}")
  os.environ[param] = temp_value
  print(f"Checking that it's set: {os.environ[param]}")

However, I'm noticing that the _load_scope method for the Run class seems to need the experiment id? Is this something that is required and should it be passed in as part of the parameters / widgets being provided by Azure ML?

@classmethod
def _load_scope(cls):
    """Load the current context from the environment.

    :return: experiment, run_id, url
    :rtype: azureml.core.Experiment, str, str
    """
    from .authentication import AzureMLTokenAuthentication
    from .experiment import Experiment
    from .workspace import Workspace

    try:
        # Load authentication scope environment variables
        subscription_id = os.environ['AZUREML_ARM_SUBSCRIPTION']
        run_id = os.environ["AZUREML_RUN_ID"]
        resource_group = os.environ["AZUREML_ARM_RESOURCEGROUP"]
        workspace_name = os.environ["AZUREML_ARM_WORKSPACE_NAME"]
        experiment_name = os.environ["AZUREML_ARM_PROJECT_NAME"]
        experiment_id = os.environ.get("AZUREML_EXPERIMENT_ID")
        workspace_id = os.environ.get("AZUREML_WORKSPACE_ID")

        if experiment_id is None:
            module_logger.warning("experiment_id cannot be found in env variable.")
        # Initialize an AMLToken auth, authorized for the current run
        token, token_expiry_time = AzureMLTokenAuthentication._get_initial_token_and_expiry()

Unfortunately, this is a poorly documented feature :-(

wjohnson avatar Nov 28 '20 15:11 wjohnson

Is there any update on this issue @lostmygithubaccount

anirbansaha96 avatar Dec 15 '20 15:12 anirbansaha96

@shbijlan from Pipelines team is discussing with engineering and will update this thread soon

longer term we are looking to have consistent support for Spark jobs across Databricks, Synapse, and others, but this will not be available for a while

lostmygithubaccount avatar Dec 15 '20 22:12 lostmygithubaccount

Sorry for delay response, The solution is to parse the script arguments and set corresponding environment variables to access the run context from within Databricks. Here is a code sample

`from azureml.core import Run import argparse import os

def populate_environ(): parser = argparse.ArgumentParser(description='Process arguments passed to script') parser.add_argument('--AZUREML_SCRIPT_DIRECTORY_NAME') parser.add_argument('--AZUREML_RUN_TOKEN') parser.add_argument('--AZUREML_RUN_TOKEN_EXPIRY') parser.add_argument('--AZUREML_RUN_ID') parser.add_argument('--AZUREML_ARM_SUBSCRIPTION') parser.add_argument('--AZUREML_ARM_RESOURCEGROUP') parser.add_argument('--AZUREML_ARM_WORKSPACE_NAME') parser.add_argument('--AZUREML_ARM_PROJECT_NAME') parser.add_argument('--AZUREML_SERVICE_ENDPOINT')

args = parser.parse_args()
os.environ['AZUREML_SCRIPT_DIRECTORY_NAME'] = args.AZUREML_SCRIPT_DIRECTORY_NAME
os.environ['AZUREML_RUN_TOKEN'] = args.AZUREML_RUN_TOKEN
os.environ['AZUREML_RUN_TOKEN_EXPIRY'] = args.AZUREML_RUN_TOKEN_EXPIRY
os.environ['AZUREML_RUN_ID'] = args.AZUREML_RUN_ID
os.environ['AZUREML_ARM_SUBSCRIPTION'] = args.AZUREML_ARM_SUBSCRIPTION
os.environ['AZUREML_ARM_RESOURCEGROUP'] = args.AZUREML_ARM_RESOURCEGROUP
os.environ['AZUREML_ARM_WORKSPACE_NAME'] = args.AZUREML_ARM_WORKSPACE_NAME
os.environ['AZUREML_ARM_PROJECT_NAME'] = args.AZUREML_ARM_PROJECT_NAME
os.environ['AZUREML_SERVICE_ENDPOINT'] = args.AZUREML_SERVICE_ENDPOINT

populate_environ() run = Run.get_context(allow_offline=False) print(run._run_dto["parent_run_id"])`

irwang avatar Sep 10 '21 18:09 irwang

Hi to all,

I have just followed your suggestion using the populate_environ() block and now running into a interactive cluster

run = Run.get_context(allow_offline=False)

but databricks is not enjoying this part with error

tmp/tmpk0eno6d8.py in 52 populate_environ() 53 ---> 54 run = Run.get_context(allow_offline=False) 55 56 #print(run._run_dto["parent_run_id"])

/databricks/python/lib/python3.8/site-packages/azureml/core/run.py in get_context(cls, allow_offline, used_for_context_manager, **kwargs) 365 if used_for_context_manager: 366 return _SubmittedRun(experiment, run_id, **kwargs) --> 367 return _SubmittedRun._get_instance(experiment, run_id, **kwargs) 368 except RunEnvironmentException as ex: 369 module_logger.debug("Could not load run context %s, switching offline: %s", ex, allow_offline)

/databricks/python/lib/python3.8/site-packages/azureml/core/run.py in _get_instance(experiment, run_id, **kwargs) 2290 run = _SubmittedRun.__instances.get(arm_scope_with_run_id) 2291 if run is None: -> 2292 run = _SubmittedRun(experiment, run_id, **kwargs) 2293 _SubmittedRun.__instances[arm_scope_with_run_id] = run 2294 return run

/databricks/python/lib/python3.8/site-packages/azureml/core/run.py in init(self, *args, **kwargs) 2295 2296 def init(self, *args, **kwargs): -> 2297 super(_SubmittedRun, self).init(*args, **kwargs) 2298 self._input_datasets = None[0m 2299 self._output_datasets = None

/databricks/python/lib/python3.8/site-packages/azureml/core/run.py in init(self, experiment, run_id, outputs, **kwargs) 171 172 """ --> 173 super(Run, self).init(experiment, run_id, outputs=outputs, **kwargs) 174 self._parent_run = None 175

/databricks/python/lib/python3.8/site-packages/azureml/_run_impl/run_base.py in init(self, experiment, run_id, outputs, logs, _run_dto, _worker_pool, _user_agent, _ident, _batch_upload_metrics, py_wd, deny_list, flush_eager, redirect_output_stream, **kwargs) 81 raise 82 ---> 83 py_wd = get_py_wd() if py_wd is None else py_wd 84 85 self._client = RunHistoryFacade(self._experiment, self._run_id, RUN_ORIGIN, run_dto=_run_dto,

/databricks/python/lib/python3.8/site-packages/azureml/history/_tracking.py in get_py_wd() 302 303 def get_py_wd(): --> 304 return PythonWorkingDirectory.get() 305 306

/databricks/python/lib/python3.8/site-packages/azureml/history/_tracking.py in get(cls) 284 logger.debug("Adding SparkDFS") 285 from azureml._history.utils.filesystem import SparkDFS --> 286 spark_dfs = SparkDFS("spark_dfs", logger) 287 fs_list.append(spark_dfs) 288 logger.debug("Added SparkDFS")

/databricks/python/lib/python3.8/site-packages/azureml/_history/utils/filesystem.py in init(self, ident, logger) 112 113 self.spark = SparkSession.builder.getOrCreate() --> 114 config = self.spark._sc._jsc.hadoopConfiguration() 115 116 dfs_cwd = self.spark._sc._gateway.jvm.org.apache.hadoop.fs.Path(".")

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/java_gateway.py in call(self, *args) 1302 1303 answer = self.gateway_client.send_command(command) -> 1304 return_value = get_return_value( 1305 answer, self.gateway_client, self.target_id, self.name) 1306

/databricks/spark/python/pyspark/sql/utils.py in deco(*a, **kw) 108 def deco(*a, **kw): 109 try: --> 110 return f(*a, **kw) 111 except py4j.protocol.Py4JJavaError as e: 112 converted = convert_exception(e.java_exception)

/databricks/spark/python/lib/py4j-0.10.9-src.zip/py4j/protocol.py in get_return_value(answer, gateway_client, target_id, name) 328 format(target_id, ".", name), value) 329 else: --> 330 raise Py4JError( 331 "An error occurred while calling {0}{1}{2}. Trace:\n{3}\n". 332 format(target_id, ".", name, value))

Py4JError: An error occurred while calling o238.hadoopConfiguration. Trace: py4j.security.Py4JSecurityException: Method public org.apache.hadoop.conf.Configuration org.apache.spark.api.java.JavaSparkContext.hadoopConfiguration() is not whitelisted on class class org.apache.spark.api.java.JavaSparkContext at py4j.security.WhitelistingPy4JSecurityManager.checkCall(WhitelistingPy4JSecurityManager.java:473) at py4j.Gateway.invoke(Gateway.java:294) at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) at py4j.commands.CallCommand.execute(CallCommand.java:79) at py4j.GatewayConnection.run(GatewayConnection.java:251) at java.lang.Thread.run(Thread.java:748)

Any Idea ?

I am working with a local python scrupt DatabricksStep.

kamakay avatar Nov 13 '21 14:11 kamakay

Hi all, is there any update on this issue? I followed the suggestion but it does not work from Databricks job cluster launched by a DatabrickStep and running a local python script.

@kamakay or @RothNRK, did you guys find any option to use the Run class from the python script in Databricks?

Thanks in advanced. BR.

cfespinoza avatar Dec 12 '23 04:12 cfespinoza