Alerting at Scale in Azure (Again)

So, you want to alert at scale in Azure. Come grab a seat and let me regale you with the pitfalls of resource types, subscription & resource group scopes, and Log Analytics.

I have previously talked about Alerting at Scale with PowerShell in this post and how to do it with PowerShell in this post. Quick recap, alerting at Scale means being able to create an alert *for all resource types* at the subscription level, regardless of region. When you break it down and limit it further than that, it becomes unwieldy to have to set alerts at the resource group level, for each resource group.

IaaS Alerting

Log Analytics

Let us talk a minute about Log Analytics. Log Analytics is still my recommended way to create alerts at scale for IaaS. For other resource types, it depends. For IaaS we can surface most of the metrics most companies want to alert on. In general the ones I always get asked for are:

  • High CPU
  • High RAM usage
  • Low Disk Space
  • Heartbeat/Availability

We can do all of these against Log Analytics workspaces as the resource type and selecting Computer as our dimension value. This means we can create 4 alerts to cover all VMs that report to that workspace.

Yes there are drawbacks to using Log Analytics. You have to install an agent, Log Analytics can be pricey. However you are spending thousands or hundreds of thousands of dollars a month running IaaS VMs, what is a few dollars per machine to send the data to Log Analytics? Now with Data Collection Rules, you can get super granular with your data collection. And do you really need to collect that metric at 10 second intervals? Probably not.

Host Platform Metrics

Many customers want to use the host platform metrics but at present there are several gotchas that exist. While Metric alerts for IaaS VMs can now be deployed at the subscription level, a marked improvement over resource group level. It is still limited by region. (Note: when creating such an alert you need to click the region in the top right and set it to anything other than “all” or the portal will tell you that what condition types are available is limited, removing the option for metric alerts. This is very confusing)

Which brings me to my next point, we should be able to set a single alert for all regions and be done with it. Instead if you have VMs in 4 regions and you want to setup 10 alerts for your VMs. You now need to create 40 alerts, 10 in each region.

There is another problem as well. A customer recently wanted Network In Total and Network Out Total alerts. When deploying them in ARM I received this error message: Error: Code=BadRequest; Message=For the resource type microsoft.compute/virtualmachines and metrics Network In Total, selecting only one resource is supported.

I tried this at the resource group level as well and I get the same message in both the portal and when deploying with ARM. So even using Host Platform Metrics we cannot deploy alerts to all VMs with certain counters. I have not tested all available counters but it would not surprise me in the least to find more than cannot be deployed in this way.

 

PaaS Resources

PaaS resources are still a total crapshoot. Unfortunately no PaaS resources metrics are surfaced to Log Analytics in the same way that IaaS metrics are. So doing metric alerts against Log Analytics as the resource type is not possible. You would have to do custom Log Search alerts against the Metrics collected as logs. (Yes, that sentence makes sense.)

SQL PaaS metric alerts can be setup at the subscription level, much like IaaS VMs, with the limit being the region again. SQL PaaS is the only resource I am aware of that can bet setup this way.

Meanwhile over in App Services and Cosmos DB there are literally no metric alerts you can deploy… Until you have actually deployed that resource type. This is the same with many other resources like networking resources, Express Route, VNet Gateways etc. Typically when you see this, that means that you can’t even deploy alerts at the resource group level, your alert has to be against that actual single resource. This becomes quite cumbersome when you have “cloud scale” resources into the hundreds or thousands. It also means you cannot proactively deploy alerts, because the resources have to exist. IE When performing a landing zone we setup the subscription deploy the base networking, RBAC, policies etc. But we cannot deploy alerts for many resource types, until all those resources are deployed.

 

Conclusion

So what was the point of all this? Hopefully it brings a little more clarity in what is possible and what is not possible when alerting at scale. As well as to help inform your decision making process around metric alerts vs log alerts, Log Analytics design, for all resource types. And hopefully, maybe, bring more awareness of how alerting should be able to be done within Azure and Azure Monitor.