Mountain West Farm Bureau Insurance
office workers empowered by business technology solutions
BLOG
9
13
2016
12.18.2020

Cloud Monitoring: What and How Much Information to Collect

Last updated:
9.16.2020
12.18.2020
No items found.
cloud resource monitoring dashboard

Network and system utilization monitoring are essential pieces of any cloud environment, helping engineers ensure consistent performance and spotting threats to availability, whether they be resource or security related, before they impact users.

There are a variety of platforms to collect data on your cloud environment. Depending on your cloud provider, some of them might be included in your contract. If you require specific features or integrations, you might add a third party monitoring platform. Some can even monitor across different public and hybrid clouds on different virtualization platforms.

Once you’ve settled on a monitoring tool, you have to decide what data to collect and how much of it to store and review. If you have a very large scale environment, this may even be a dedicated role for an employee. Some cloud environments will generate constant data that must be reviewed in order to meet internal SLAs or guarantee availability of your platform to the public. Other environments will only generate data rarely. In either case, the more information you can afford to store and review, the better you’ll be able to prevent and troubleshoot any problems with your virtual machines.

There are three main categories of metrics to monitor: work, resource, and changes

Work metrics are focused around the output and processes running in your cloud environment. The goal is to measure the effectiveness of your applications rather than the performance of the system itself – more a measure of the work being done than the effort required to do it.

Work metrics are measured in throughput, performance, success, and errors. Throughput is the amount of work done in a set amount of time, success is the percentage that completed, errors are the percentage of errors per throughput, and performance is focused on efficiency.

For example, a datastore might be monitored based on its queries. The throughput would report the number of queries per second. Success would report the percentage of queries that were executed without error, while error would be the percentage of queries that failed for one reason or another. Different errors might be configured – for a datastore, you might report on exceptions as well as old data. Performance for the datastore might be monitored in terms of the query time in seconds.

Resource metrics are what many people think of first when it comes to cloud monitoring. They’re the actual consumption of your cloud resources and are therefore some of the most vital to keep an eye on, so you can enforce your SLA, make sure you’re scaling appropriately, and keep monthly expenses in check.

Resource metrics you’ll want to measure include utilization, saturation, errors, and availability. Utilization refers to the percentage of the resource that is currently being used. Saturation is the amount of work waiting to be completed by the resource – usually this will only occur when utilization is at or near 100%. Errors are related to the resource at hand – like storage device errors. And availability is the percent of time that the resource has been responding to the monitoring tool or other requests.

You’ll want to monitor resource metrics for your storage, CPU, and memory at the least. Other areas that might be worth monitoring are microservices and databases. For a storage disk, utilization would measure the amount of time the device was working, saturation would measure the length of the wait queue to write or read, errors would report any disk problems, and availability would be the percent of time the device has been available to write.

Change metrics are focused around specific events happening in your environment. Many monitoring tools can be configured to check on these events, which are not ongoing but can offer valuable insight if you are trying to track down why a system is not performing adequately.

Change metrics can include alerts for specific events, they can record updates or code builds, and they can also monitor the addition of new VMs and resources. Usually they just exist as a log of the event, the time, and additional preconfigured information like logs of a failed job.

Deciding How Much Data to Gather

Having too much information is rarely a problem, but it can make your life harder when you are trying to track down a known issue and resolve it. That means you need to be able to parse and understand your information. Organize your logs in a sensible manner and name them appropriately.

Granularity is a key concept when it comes to monitoring, and it refers to how often data is collected from your cloud environment. If you don’t pull data frequently enough, you might miss the event that took down your system and be totally unable to troubleshoot. This becomes a fine balance between reporting often enough and avoiding drowning in too many data points. Averaging your data over time can also lead to trouble, as you’ll miss sudden resource spikes as they are combined with your usual consumption.

Recording data constantly, or even every second, can itself lead to lower performance. If you’re not testing something specific, a slightly longer timeframe is probably acceptable. You also need to store your logs, so make a plan for archiving as needed. This varies by application – for some, you may need to store it for a year or several years, as referencing past performance may or may not be important. Seasonal spikes or sudden changes, like a major system update, often make historical comparisons more valuable.

Recent Blog Posts

lunavi logo alternate white and yellow
7.21.2021
07
.
19
.
2021
How Lunavi Approaches Digital Transformation: HostingAdvice Company Profile

For prospective clients and partners, the history, ethos, and capabilities of a vendor are paramount. HostingAdvice.com recently profiled Lunavi to explore our approach.

Learn more
lunavi logo alternate white and yellow
5.20.2021
04
.
26
.
2021
Test Automation Best Practices: Balancing Confidence with Efficiency

Automation can instill confidence to release software and improve the team’s ability to create high-quality applications in the fastest and most efficient way possible. Essentially, it eliminates the need to compromise or choose one set of priorities over another. Instead, it allows teams to strike a balance between confidence/coverage and speed/efficiency. But automation isn’t a one-size-fits-all solution.

Learn more
lunavi logo alternate white and yellow
8.17.2021
04
.
20
.
2021
Building Your Cloud Foundation Part 1: Core Configuration & Governance

This first area of focus establishes your cloud policy, or the way your organization consumes and manages cloud resources. Learn how to establish proper scope and mitigate tangible risks through corporate policy and standards.

Learn more