Multi-GHz processors need maximum memory bandwidth to reach their full potential.SQL Server needs maximum memory bandwidth to reach its full potential. They need to access memory as fast and with as much throughput as possible.
A bottleneck develops when processors have to compete for bandwidth over a single system bus (the little wired lines on the circuit board that connect the CPU chip(s) with the Memory chip(s), and between the CPU Sockets).
Manufacturers have attempted to remediate this bottleneck in one of 2 ways. Either by employing a super high speed data bus (expensive), or by using several “nodes” linked together across a high speed “NUMA” connection (cheaper). Each NUMA node contains processors and memory, much like a small SMP system.An advanced memory controller allows a node to use memory on all other nodes, thus creating a single system image.
This means that when a processor accesses remote memory (does not lie within its own node), the data must be transferred across the NUMA connection which is much slower than accessing local memory. Because of this, memory access times depend greatly on the location of memory relative to the location of the node from which it was accessed.
Crossing NUMA nodes is very expensive, and can become a bottleneck for SQL Server’s high-memory bandwidth demands. Care should be taken when presenting sockets and cores to guest machines to ensure it aligns correctly with SQL Server’s requirements. Otherwise, the high latency of remote memory accesses will leave processors under-utilized, constantly waiting for data to be transferred to the local node, and users left waiting for their result set to be returned.
On such a system, query performance can be highly variable, and tempdb performance horrific under load.
Thinking It Through
If requested data happens to be placed in memory by node 1 on one benchmarking run, but a subsequent run happens to place that same workload now on node 2, then the working set needs to first be transferred to the new node now handling that work. Still faster than going back to disk, but an eternity when compared to local NUMA node access. This phenomenon can make troubleshooting SQL Server query performance extremely difficult or just plain impossible.
Example
Picture a physical host with 2 sockets, 16 cores.
We wish to provision a VM guest with 8 cores for a future SQL instance.
In above example, 2 sockets 4 cores each would best align with the hardware NUMA layout, and SQL Server will be the happiest about the 8 cores that you just provisioned for it.
If a VM were to be provisioned with 8 sockets, 1 core each, then the virtual machine architecture would not match the underlying sockets, NUMA nodes would be crossed constantly, tempdb performance would suffer, and query performance would be inconsistent at best.
Rule of thumb:
Add sockets to match your hardware \\ add Cores in Fours evenly across those sockets.
Disable CPU HOT-ADD!
The virtual NUMA topology is based on the NUMA topology of the underlying physical host. Once a virtual machine’s virtual NUMA topology is initialized, it does not change unless the number of vCPUs in that virtual machine is changed which causes an imbalance in the NUMA layout. A SQL Server’s NUMA configuration should not deviate once windows is powered on, and should have VM Guest CPU hot-add disabled.