DeepSpeed-MII
DeepSpeed-MII copied to clipboard
AML deployment error due to missing az cli arguments
When trying to run the aml example, e.g. bloom aml, it tries to run get_acr_name() but fails because its missing the resource group name argument. Is there be a way to pass in user arguments such as the resource group, subscription, etc? It would also be nice to expose more arguments for the aml online endpoints such as the auth_mode, e.g. we arent allowed to use keys, only aml_tokens in production environments. But I can also imagine other deployment attributes/arguments being useful as well such as instance_count or type.
[2022-12-08 10:53:37,253] [INFO] [deployment.py:87:deploy] ************* MII is using DeepSpeed Optimizations to accelerate your model *************
ERROR: the following arguments are required: --resource-group/-g, --name/-n
Examples from AI knowledge base:
https://aka.ms/cli_ref
Read more about the command in reference docs
------------------------------
Unable to obtain ACR name from Azure-CLI. Please verify that you:
- Have Azure-CLI installed (https://learn.microsoft.com/en-us/cli/azure/install-azure-cli)
- Are logged in to an active account on Azure-CLI ($az login)
- Have Azure-CLI ML plugin installed ($az extension add --name ml)
------------------------------
Traceback (most recent call last):
File "/mnt/c/Users/davidaponte/Documents/CS677-DeepLearning/deeplearning/deeplearning/deep_learning/text_to_image/deepspeed_mii/bloom560m-aml.py", line 7, in <module>
mii.deploy(task='text-generation',
File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/deployment.py", line 112, in deploy
_deploy_aml(deployment_name=deployment_name, model_name=model, version=version)
File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/deployment.py", line 124, in _deploy_aml
acr_name = mii.aml_related.utils.get_acr_name()
File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/aml_related/utils.py", line 31, in get_acr_name
raise (e)
File "/home/bambam/.pyenv/versions/deeplearning/lib/python3.9/site-packages/mii/aml_related/utils.py", line 13, in get_acr_name
acr_name = subprocess.check_output(
File "/home/bambam/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 420, in check_output
return run(*popenargs, stdout=PIPE, timeout=timeout, check=True,
File "/home/bambam/.pyenv/versions/3.9.0/lib/python3.9/subprocess.py", line 524, in run
raise CalledProcessError(retcode, process.args,
subprocess.CalledProcessError: Command '['az', 'ml', 'workspace', 'show', '--query', 'container_registry']' returned non-zero exit status 2.
Setup: deepspeed==0.7.6 deepspeed-mii==0.0.4 py3.9.0 Ubuntu 20.04.4 LTS (Focal Fossa)
@aponte411 currently we expect the user to set resource group and subscription from the Azure-CLI like so:
az account set --subscription "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx"
I agree that exposing more options and expanding the AML deployment capabilities would be nice. Let me know if you have some time to help test/debug/expand these capabilities!
Hi, i resolved this - i had an issue with torch (had to pip uninstall nvidia_cublas_cu11) and also i wasn't on a GPU VM. Managed to build the folder with deploy.sh and deploying to a managed endpoint now
@aponte411 - did you make any progress? i'm getting the same error in Jupyter notebook.
deepspeed==0.8.2 deepspeed-mii==0.05+unknown python==3.8.0 Ubuntu==20.04.1
@buswrecker I can run deepspeed mii from the gpu vm but I still can't deploy, I get the same error:
subprocess.CalledProcessError: Command '['az', 'ml', 'workspace', 'show', '--query', 'container_registry']' returned non-zero exit status 2.
I also could not get this working after following instructions in the readme. The only way I could use aml is after I overrode the get_acr_name()
function to return my acr name instead of calling the az cli command. Is there a way to set a default --name
argument for this command so this can be fixed and it returns the correct acr name?
The command I'm talking about is:
["az",
"ml",
"workspace",
"show",
"--query",
"container_registry"],
Note, I also tried putting --name myworkspacename
in as an argument and it just returned ------
I'm facing same problem, on GPU VM.
Maybe, adding "shell=True" will resolve this problem?
acr_name = subprocess.check_output(
["az",
"ml",
"workspace",
"show",
"--query",
"container_registry"],
text=True, shell=True)