Cloud Monitoring: What and How Much Information to Collect
Network and system utilization monitoring are essential pieces of any cloud environment, helping engineers ensure consistent performance and spotting threats to availability, whether they be resource or security related, before they impact users.
There are a variety of platforms to collect data on your cloud environment. Depending on your cloud provider, some of them might be included in your contract. If you require specific features or integrations, you might add a third party monitoring platform. Some can even monitor across different public and hybrid clouds on different virtualization platforms.
Once you’ve settled on a monitoring tool, you have to decide what data to collect and how much of it to store and review. If you have a very large scale environment, this may even be a dedicated role for an employee. Some cloud environments will generate constant data that must be reviewed in order to meet internal SLAs or guarantee availability of your platform to the public. Other environments will only generate data rarely. In either case, the more information you can afford to store and review, the better you’ll be able to prevent and troubleshoot any problems with your virtual machines.
There are three main categories of metrics to monitor: work, resource, and changes
Work metrics are focused around the output and processes running in your cloud environment. The goal is to measure the effectiveness of your applications rather than the performance of the system itself – more a measure of the work being done than the effort required to do it.
Work metrics are measured in throughput, performance, success, and errors. Throughput is the amount of work done in a set amount of time, success is the percentage that completed, errors are the percentage of errors per throughput, and performance is focused on efficiency.
For example, a datastore might be monitored based on its queries. The throughput would report the number of queries per second. Success would report the percentage of queries that were executed without error, while error would be the percentage of queries that failed for one reason or another. Different errors might be configured – for a datastore, you might report on exceptions as well as old data. Performance for the datastore might be monitored in terms of the query time in seconds.
Resource metrics are what many people think of first when it comes to cloud monitoring. They’re the actual consumption of your cloud resources and are therefore some of the most vital to keep an eye on, so you can enforce your SLA, make sure you’re scaling appropriately, and keep monthly expenses in check.
Resource metrics you’ll want to measure include utilization, saturation, errors, and availability. Utilization refers to the percentage of the resource that is currently being used. Saturation is the amount of work waiting to be completed by the resource – usually this will only occur when utilization is at or near 100%. Errors are related to the resource at hand – like storage device errors. And availability is the percent of time that the resource has been responding to the monitoring tool or other requests.
You’ll want to monitor resource metrics for your storage, CPU, and memory at the least. Other areas that might be worth monitoring are microservices and databases. For a storage disk, utilization would measure the amount of time the device was working, saturation would measure the length of the wait queue to write or read, errors would report any disk problems, and availability would be the percent of time the device has been available to write.
Change metrics are focused around specific events happening in your environment. Many monitoring tools can be configured to check on these events, which are not ongoing but can offer valuable insight if you are trying to track down why a system is not performing adequately.
Change metrics can include alerts for specific events, they can record updates or code builds, and they can also monitor the addition of new VMs and resources. Usually they just exist as a log of the event, the time, and additional preconfigured information like logs of a failed job.
Deciding How Much Data to Gather
Having too much information is rarely a problem, but it can make your life harder when you are trying to track down a known issue and resolve it. That means you need to be able to parse and understand your information. Organize your logs in a sensible manner and name them appropriately.
Granularity is a key concept when it comes to monitoring, and it refers to how often data is collected from your cloud environment. If you don’t pull data frequently enough, you might miss the event that took down your system and be totally unable to troubleshoot. This becomes a fine balance between reporting often enough and avoiding drowning in too many data points. Averaging your data over time can also lead to trouble, as you’ll miss sudden resource spikes as they are combined with your usual consumption.
Recording data constantly, or even every second, can itself lead to lower performance. If you’re not testing something specific, a slightly longer timeframe is probably acceptable. You also need to store your logs, so make a plan for archiving as needed. This varies by application – for some, you may need to store it for a year or several years, as referencing past performance may or may not be important. Seasonal spikes or sudden changes, like a major system update, often make historical comparisons more valuable.