Cloud Monitoring Part II: More on Granularity
We introduced some key concepts of cloud monitoring in our blog post earlier this week, namely the three types of data you need to collect to keep an eye on your cloud infrastructure. Today we’ll dive a little further into another factor mentioned: granularity.
The granularity of your cloud monitoring data is how often you record the state of each metric. This can significantly change the visibility into your environment, as not having a granular enough view averages out potential problems to the point where they may not appear as a red flag upon troubleshooting review.
Granularity becomes a careful balancing act where collecting too much data taxes your system and negatively effects performance, while not taking the pulse often enough can lead to ineffective cloud monitoring.
Polling Frequency and Data Retention
Almost any cloud monitoring platform that you choose will check vCenter for data using the API, collecting information on system performance, tasks, events, inventory, or any other metric you have configured. The data will be stored at a set rate of granularity, which also involves how long you retain the data.
If you have a decently large virtual data center, storing very granular monitoring data can cost your system performance and incur some storage costs that may not be favorable to your budget (and also performance as your datastores grow).
By default, your system might poll performance data every 10 or 15 minutes. You should try and find a monitoring platform that can be configured to poll more often than this if possible. You might not always configure it for a more frequent setting, but the data is valuable to have. vSphere often faces a lowest polling frequency of five minutes, but the vSphere API also enables 20 second data points for the previous five minutes, which ends up being about 15 data points.
That doesn’t sound like a lot of data to store, but if your environment is even mildly complex, you will want to monitor anywhere from ten to dozens of metrics. For practical purposes, after a few days you’ll likely change more granular data into a rollup report that instead contains hours or days in summary form.
This can lead to problems down the road when trying to analyze a resource utilization spike, for example. If your granularity is set at five minute intervals, a CPU utilization spike might appear at 110% during a specific five minute interval. But if you switch to an hourly rate, that CPU spike now appears at 60%, and you only know that it occurred at some point during a given hour. Once you are in the day range of granularity, it can be very difficult to pinpoint where and why a system began to perform poorly.
Experiment to see at what granularity performance might be impacted in a test environment. If your application is not mission-critical, you may not need as granular of data. If it is in a steady-state, you’re less likely to see strange spikes, and would likely catch an anomaly over a longer period of time. If you’re planning on some change management in the near future, you might want to dial up the granularity to see at what point the metrics were affected by your changes. Ultimately, the optimal granularity is an individual choice based on the state of your virtual machine(s), the application at hand, and your available resources and budget.