azure-sdk-for-python
azure-sdk-for-python copied to clipboard
getting a notification when a training job is done that was started natively inside an Azure Compute VM
I have created a VM using the Azure Compute menu. I am running a Python training code that is long running via tmux directly inside the VM by SSH -i -X into it. However, since these VMs are expensive ($4 an hour), I would like the node to:
1- send me an email notification when the job is done or 1- send me an email notification when the job is done 2- stop (not delete) the VM so I won't be charged for it.
I have turned on all the notifications as you see below:

However, I have not been getting any email/text notifications for any of my Python jobs that have been finished. I only get email notification when an mlflow related script was run like below.

Thanks a lot for any guidance.
Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like A clear and concise description of what you want to happen.
Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.
Additional context Add any other context or screenshots about the feature request here.
Hi @monajalal, thank you for opening an issue! Is there a particular package from the Azure SDK that you're using in your workflow? If not, you may get a reliable response by asking a question at a VM-specific forum like this Q&A page if you haven't already.
I'll tag the ML team either way in case someone would know how to help, but I would recommend opening an issue at the forum I linked. @luigiw @azureml-github
Hello @monajalal. Have you tried to use compute clusters in AzureML? AzureML compute clusters automatically scale up/down based on your training job requirements. And you can subscribe to events of when the training finishes.
Here're some references.
AzureML compute cluster, https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-cluster?tabs=python
Subscribe to when an AzureML training job finishes, https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-event-grid
@luigiw for example, my job was killed since VM ran out of space and then the VM didn't exit even though I have enabled "auto shutdown" in "Preview Features" in Azure Compute.
Also, I cannot stop the VM as it is stated "Unusable". I was able to shutdown the VM by ssh-ing to it but am I billed for the duration the VM was Unusable? Also, how do I bring back the VM to usable?

@monajalal to clarify, is this an AzureML compute instance you're using? Looks like you need a bigger VM. You can also consider creating a dedicated training cluster. https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-cluster?tabs=python
Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!
@xiangyan99 for this special case, I decide to run all my jobs as a pipeline using a compute cluster rather than a compute instance node. Since, I am running them as job, natively I get email when the job finishes or fails. However, it still remains true that a stand-alone training job (python train.py) inside azure compute instance would not yield any email upon successful finish or failure.