October 12, 2020
NUMA architectures allow for greater scalability, which is of course great for building cloud data centers. But if your virtual machines aren’t configured correctly, NUMA can cause performance degradation in VMware virtualized servers.
Here’s an overview of what NUMA is, why it’s useful for cloud computing, and how to address it when configuring your VMware cloud server.
NUMA, or non-uniform memory access, is a compute architecture model that allows for greater scalability when using multiprocessing. It evolved in opposition to SMP, or Symmetric Multiprocessor architecture. With SMP, it is difficult to scale past 8 – 12 CPUs, because each CPU is fighting over shared memory. Servers enabled with NUMA reduce the number of CPUs that have easy access to a given memory bus, increasing performance by limiting competition for that memory.
Say you’re building a planter at home. That’s analogous to a process being computed by a server under NUMA. You have most of the parts: nails, wood, etc. But you’re missing just a few screws so you go to the hardware store for them. In this analogy, all of the parts you have yourself are the local memory node; while the screws are remote memory. It makes sense to store as many of the parts locally as you can because it’s more efficient. But occasionally you need to go down the street to get what you need.
Here is a diagram of NUMA memory acccess courtesy of Frank Denneman, whose excellent blog post can give you additional context.
Cloud technology is built on virtualization, computing technology that abstracts the available server resources from an array of servers — all of the CPU power, memory, and storage available — and uses them as a single pool to deploy many virtual servers. This means a single server could be hosting dozens of VMs.
At this scale, servers are often chosen because they are built with many processing cores. As described above, NUMA can improve performance on each of those servers. When pooled together, you could have hundreds of CPUs attempting to access one shared memory bus. Each memory node (basically the closest section of memory) uses a high speed interconnection (a “memory controller”) to pass tasks from the CPUs to the memory or Input/Output and back.
However, the application must be NUMA aware in order to avoid performance issues. VMware and Hyper-V have been designed with NUMA in mind. More on that in a moment.
Generally speaking, NUMA will improve performance. But occasionally the task will require more memory than it can access within the attached node. When that happens, performance degradation can occur as the CPU requests memory from a neighboring NUMA node, because the neighboring node prioritizes local CPU requests and there is associated latency involved with traversing the memory controller. If software is not written with NUMA in mind, performance can be expected to be poor. Remote memory access should be minimized within the code and design of the application itself, using memory on local nodes almost exclusively.
In vSphere, when a VM is sized larger than a single NUMA node, a virtual NUMA topology is generated (vNUMA) which enables the workload the same performance benefits of NUMA while still supporting vMotion and other vSphere features. Enabling hot add for CPU or memory will disable virtual NUMA. Other hypervisors like Hyper-V also feature a virtual NUMA to mimic physical NUMA CPU/memory groupings.
Ultimately this allows bigger, more reliable, and higher performance virtualized workloads compared to apps running on physical servers, even those with NUMA hardware. You should be able to toggle memory spanning on or off, should you not wish machines to attempt to use neighboring memory from another NUMA node.
vSphere uses this NUMA scheduler to balance the workloads, assigning memory when requested from local “home” nodes under NUMA if at all possible. The kernel can even dynamically move a machine to a different home node if required.
When it comes to VMware virtualization on top of NUMA-enabled servers, here are some factors to keep in mind.
You should not enable node interleaving within the server BIOS. By default it is likely disabled, which simply means NUMA is working.
Your host servers should have an equal amount of memory available for each NUMA node. In other words, a 16 core processor with 64 GB of RAM can be divided into four NUMA nodes with 4 cores and 16 GB of memory each. If you use a NUMA aware application on top of an unbalanced NUMA node configuration, the app will try and find the best performance route and will likely unbalance several nodes, maxing them out, while leaving others relatively underutilized. This can happen as a system grows over time.
Say you need to expand your SQL server and provision some more vCPU cores. Suddenly you have 9 NUMA with varying amounts of cores attached. It might have made sense at the time, but rearchitecting this to have a balanced number of NUMA nodes – even bringing it down to two – will likely result in shorter queries and better overall performance.
Always assign an equal or lesser about of vCPUs as the total number of cores in a single socket. Try to keep individual VMs from spanning multiple NUMA nodes if possible. Instead provision a new VM with a larger node.
While we haven’t yet moved our entire cloud platform to vSphere 6.5, it does bring some changes to how VMware handles NUMA and vNUMA settings and defaults. For more on those changes and recommended best practices, check out this blog post.
And of course, here's the classic meme that inspired our title. Couldn't resist.