Flameman/numa

For more interesting projects done by Flameman, be sure to checkout his project index

= NUMA =

What is NUMA?
What is NUMA? It is an architecture where the memory access times for different regions of memory from a given processor varies according to the "distance" of the memory region from the processor. Each region of memory to which access times are the same from any cpu, is called a node. On such architectures, it is beneficial if the kernel tries to minimize inter node communications. Schemes for this range from kernel text and read-only data replication across nodes, and trying to house all the data structures that key components of the kernel need on memory on that node.

What does NUMA stand for?
NUMA stands for Non-Uniform Memory Access.

OK, So what does Non-Uniform Memory Access really mean to me?
Non-Uniform Memory Access means that it will take longer to access some regions of memory than others. This is due to the fact that some regions of memory are on physically different busses from other regions. For a more visual description, please refer to the section on NUMA architeture implementations. Also, see the real-world analogy for the NUMA architecture. This can result in some programs that are not NUMA-aware performing poorly. It also introduces the concept of local and remote memory.

What is the difference between NUMA and SMP?
The NUMA architecture was designed to surpass the scalability limits of the SMP architecture. With SMP, which stands for Symmetric Multi-Processing, all memory access are posted to the same shared memory bus. This works fine for a relatively small number of CPUs, but the problem with the shared bus appears when you have dozens, even hundreds, of CPUs competing for access to the shared memory bus. NUMA alleviates these bottlenecks by limiting the number of CPUs on any one memory bus, and connecting the various nodes by means of a high speed interconnect.

What is the difference between NUMA and ccNUMA?
The difference is almost nonexistent at this point. ccNUMA stands for Cache-Coherent NUMA, but NUMA and ccNUMA have really come to be synonymous. The applications for non-cache coherent NUMA machines are almost non-existent, and they are a real pain to program for, so unless specifically stated otherwise, NUMA actually means ccNUMA.

What is a node?
One of the problems with describing NUMA is that there are many different ways to implement this technology. This has led to a plethora of "definintions" for node. A fairly technically correct and also fairly ugly definition of a node is: a region of memory in which every byte has the same distance from each CPU. A more common definition is: a block of memory and the CPUs, I/O, etc. physically on the same bus as the memory. Some architectures do not have memory, CPUs, and I/O all on the same physical bus, so the second definition does not truly hold. In many cases, the less technical definition should be sufficient, but often the technical definition is more correct.

What is meant by local and remote memory?
The terms local memory and remote memory are typically used in reference to a currently running process. That said, local memory is typically defined to be the memory that is on the same node as the CPU currently running the process. Any memory that does not belong to the node on which the process is currently running is then, by that definition, remote. Local and remote memory can also be used in reference to things other than the currently running process. When in interrupt context, there technically is no currently executing process, but memory on the node containing the CPU handling the interrupt is still called local memory. Also, you could use local and remote memory in terms of a disk. For example if there was a disk (attatched to node 1) doing a DMA, the memory it is reading or writing would be called remote if it were located on another node (ie: node 0).

What do you mean by distance?
NUMA-based architectures necessarily introduce a notion of distance between system components (ie: CPUs, memory, I/O busses, etc). The metric used to determine a distance varies, but hops is a popular metric, along with latency and bandwidth. These terms all mean essentially the same thing that they do when used in a networking context (mostly because a NUMA machine is not all that different from a very tightly coupled cluster). So when used to describe a node, we could say that a particular range of memory is 2 hops (busses) from CPUs 0..3 and SCSI Controller 0. Thus, CPUs 0..3 and the SCSI Controller are a part of the same node.

Could you give a real-world analogy of the NUMA architecture to help understand all these terms?
Imagine that you are baking a cake. You have a group of ingredients (=memory pages) that you need to complete the recipe(=process). Some of the ingredients you may have in your cabinet(=local memory), but some of the ingredients you might not have, and have to ask a neighbor for(=remote memory). The general idea is to try and have as many of the ingredients in your own cabinet as possible, since this reduces your time and effort in making the cake. You also have to remember that your cabinets can only hold a fixed amount of ingredients(=physical nodal memory). If you try and buy more, but you have no room to store it, you may have to ask your neighbor to keep it in his/her cabinet until you need it(=local memory full, so allocate pages remotely). A bit of a strange example, I'll admit, but I think it works. If you have a better analogy, I'm all ears! ;)

Why should I use NUMA? What are the benefits of NUMA?
The main benefit of NUMA is, as mentioned above, scalability. It is extremely difficult to scale SMP past 8-12 CPUs. At that number of CPUs, the memory bus is under heavy contention. NUMA is one way of reducing the number of CPUs competing for access to a shared memory bus. This is accomplished by having several memory busses and only having a small number of CPUs on each of those busses. There are other ways of building massively multiprocessor machines, but this is a NUMA FAQ, so we'll leave the discussion of other methods to other FAQs.

What are the peculiarities of NUMA
CPU and/or node caches can result in NUMA effects. For example, the CPUs on a particular node will have a higher bandwidth and/or a lower latency to access the memory and CPUs on that same node. Due to this, you can see things like lock starvation under high contention. This is because if CPU x in the node requests a lock already held by another CPU y in the node, it's request will tend to beat out a request from a remote CPU z.

What are some alternatives to NUMA?
Also, splitting memory up and (possibly arbitrarily) assigning it to groups of CPUs can give some performance benefits similar to actual NUMA. A setup like this would be like a regular NUMA machine where the line between local and remote memory is blurred, since all the memory is actually on the same bus. The PowerPC Regatta system is an example of this. You can achieve some NUMA-like performance by using clusters as well. A cluster is very similar to a NUMA machine, where each individual machine in the cluster becomes a node in our virtual NUMA machine. The only real difference is the nodal latency. In a clustered environment, the latency and bandwidth on the internodal links are likely to be much worse.