cluster-api-provider-azure Enable configuration and disabling of boot diagnostics

/kind feature

Describe the solution you'd like [A clear and concise description of what you want to happen.]

Currently, as per the implementation from #901, boot diagnostics are enabled by default and use an Azure managed storage account. This storage account limits the boot logs to 1Gb per machine and has a cost of $0.05/perGB/Month.

You can also use your own storage account with boot diagnostics which would allow you to set customer retention policies for the data and also have more access to the underlying files being created should you need to.

I would like to propose that we add configuration options so that a user can configure their own storage account if they need to, but then also have the ability to disable the diagnostics should they not be interested.

In large deployments, eg a managed service running thousands of Machines, this data is costly and likely not always required (how often do we expect boot failures?), so I think it would be good to have the choice to disable the diagnostics as well to save money in large deployments.

I would expect an API something along the lines of, but happy to have discussion if others have suggestions:

diagnostics:
  boot:
     storageAccountType: AzureManaged | CustomerManaged | Disabled # defaults to AzureManaged for backwards compatibility
     customerManaged: # This is only valid to be set when the account type is customer managed (a discriminated union)
       storageAccountName: <blah>

Happy to volunteer to implement this if this is something maintainers feel comfortable supporting.

Anything else you would like to add: [Miscellaneous information that will assist in solving the issue.]

Environment:

cluster-api-provider-azure version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

Jun 20 '22 12:06 JoelSpeed

No objections from my part to making this configurable and enabled by default to preserve back compat.

For the API, we might want to have storageUri instead of storage account name since that's what the Azure APIs use. I would also avoid the terms "CustomerManaged" and "AzureManaged" and instead stick with the official terms, i.e. Managed and UserManaged https://docs.microsoft.com/en-us/azure/virtual-machines/boot-diagnostics.

The way the Azure API works is as follows, for reference:

// DiagnosticsProfile specifies the boot diagnostic settings state. <br><br>Minimum api-version:
// 2015-06-15.
type DiagnosticsProfile struct {
	// BootDiagnostics - Boot Diagnostics is a debugging feature which allows you to view Console Output and Screenshot to diagnose VM status. <br>**NOTE**: If storageUri is being specified then ensure that the storage account is in the same region and subscription as the VM. <br><br> You can easily view the output of your console log. <br><br> Azure also enables you to see a screenshot of the VM from the hypervisor.
	BootDiagnostics *BootDiagnostics `json:"bootDiagnostics,omitempty"`
}

// BootDiagnostics boot Diagnostics is a debugging feature which allows you to view Console Output and
// Screenshot to diagnose VM status. <br><br> You can easily view the output of your console log. <br><br>
// Azure also enables you to see a screenshot of the VM from the hypervisor.
type BootDiagnostics struct {
	// Enabled - Whether boot diagnostics should be enabled on the Virtual Machine.
	Enabled *bool `json:"enabled,omitempty"`
	// StorageURI - Uri of the storage account to use for placing the console output and screenshot. <br><br>If storageUri is not specified while enabling boot diagnostics, managed storage will be used.
	StorageURI *string `json:"storageUri,omitempty"`
}

Jun 20 '22 15:06 CecileRobertMichon

That makes sense to me, on the StorageURI front, the only question I have is, since that's a common format https://<account-name>.blob.core.windows.net/, I was thinking we could make it easier for users by just doing that string substitution for them, rather than having them have to provider the full URI. Is that not something you would normally do in CAPZ?

Jun 20 '22 15:06 JoelSpeed

It would be something we could do if we were confident no other formats were possible but given https://docs.microsoft.com/en-us/azure/storage/common/storage-account-overview#azure-dns-zone-endpoints-preview that may already not be true.

Jun 20 '22 15:06 CecileRobertMichon

Cool, I'll look into working up a PR over the next few days

Jun 20 '22 16:06 JoelSpeed

/assign @JoelSpeed

Jun 21 '22 22:06 CecileRobertMichon

Following up from the discussion, here is a PR that implements the agreed design: https://github.com/kubernetes-sigs/cluster-api-provider-azure/pull/2528

Aug 02 '22 17:08 damdo

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

Oct 31 '22 18:10 k8s-triage-robot

/remove-lifecycle stale

This is waiting on review

Nov 01 '22 20:11 JoelSpeed

cluster-api-provider-azure cluster-api-provider-azure copied to clipboard

Enable configuration and disabling of boot diagnostics

cluster-api-provider-azure
cluster-api-provider-azure copied to clipboard