azure-sdk-for-python icon indicating copy to clipboard operation
azure-sdk-for-python copied to clipboard

getting a notification when a training job is done that was started natively inside an Azure Compute VM

Open monajalal opened this issue 2 years ago • 3 comments

I have created a VM using the Azure Compute menu. I am running a Python training code that is long running via tmux directly inside the VM by SSH -i -X into it. However, since these VMs are expensive ($4 an hour), I would like the node to:

1- send me an email notification when the job is done or 1- send me an email notification when the job is done 2- stop (not delete) the VM so I won't be charged for it.

I have turned on all the notifications as you see below: Screenshot from 2022-12-21 12-03-04

However, I have not been getting any email/text notifications for any of my Python jobs that have been finished. I only get email notification when an mlflow related script was run like below.

Screenshot from 2022-12-21 12-02-23

Thanks a lot for any guidance.

Is your feature request related to a problem? Please describe. A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like A clear and concise description of what you want to happen.

Describe alternatives you've considered A clear and concise description of any alternative solutions or features you've considered.

Additional context Add any other context or screenshots about the feature request here.

monajalal avatar Dec 21 '22 17:12 monajalal

Hi @monajalal, thank you for opening an issue! Is there a particular package from the Azure SDK that you're using in your workflow? If not, you may get a reliable response by asking a question at a VM-specific forum like this Q&A page if you haven't already.

I'll tag the ML team either way in case someone would know how to help, but I would recommend opening an issue at the forum I linked. @luigiw @azureml-github

mccoyp avatar Dec 27 '22 19:12 mccoyp

Hello @monajalal. Have you tried to use compute clusters in AzureML? AzureML compute clusters automatically scale up/down based on your training job requirements. And you can subscribe to events of when the training finishes.

Here're some references.

AzureML compute cluster, https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-cluster?tabs=python

Subscribe to when an AzureML training job finishes, https://learn.microsoft.com/en-us/azure/machine-learning/how-to-use-event-grid

luigiw avatar Dec 28 '22 18:12 luigiw

@luigiw for example, my job was killed since VM ran out of space and then the VM didn't exit even though I have enabled "auto shutdown" in "Preview Features" in Azure Compute.

Also, I cannot stop the VM as it is stated "Unusable". I was able to shutdown the VM by ssh-ing to it but am I billed for the duration the VM was Unusable? Also, how do I bring back the VM to usable?

MicrosoftTeams-image (12)

Screenshot from 2023-01-03 10-07-18

monajalal avatar Jan 03 '23 15:01 monajalal

@monajalal to clarify, is this an AzureML compute instance you're using? Looks like you need a bigger VM. You can also consider creating a dedicated training cluster. https://learn.microsoft.com/en-us/azure/machine-learning/how-to-create-attach-compute-cluster?tabs=python

luigiw avatar Jan 04 '23 21:01 luigiw

Hi, we're sending this friendly reminder because we haven't heard back from you in a while. We need more information about this issue to help address it. Please be sure to give us your input within the next 7 days. If we don't hear back from you within 14 days of this comment the issue will be automatically closed. Thank you!

ghost avatar Jan 13 '23 20:01 ghost

@xiangyan99 for this special case, I decide to run all my jobs as a pipeline using a compute cluster rather than a compute instance node. Since, I am running them as job, natively I get email when the job finishes or fails. However, it still remains true that a stand-alone training job (python train.py) inside azure compute instance would not yield any email upon successful finish or failure.

monajalal avatar Jan 13 '23 20:01 monajalal