Central processing unit
History | CPU
Operation | Design and implementation
Design and implementation
Integer range
The way a CPU represents numbers is a design choice that affects the most basic ways in which the device functions. Some early digital computers used an electrical model of the common decimal (base ten) numeral system to represent numbers internally. A few other computers have used more exotic numeral systems like ternary (base three). Nearly all modern CPUs represent numbers in binary form, with each digit being represented by some two-valued physical quantity such as a "high" or "low" voltage.
Related to number representation is the size and precision of numbers that a CPU can represent. In the case of a binary CPU, a bit refers to one significant place in the numbers a CPU deals with. The number of bits (or numeral places) a CPU uses to represent numbers is often called "word size", "bit width", "data path width", or "integer precision" when dealing with strictly integer numbers (as opposed to floating point). This number differs between architectures, and often within different parts of the very same CPU. For example, an 8-bit CPU deals with a range of numbers that can be represented by eight binary digits (each digit having two possible values), that is, 28 or 256 discrete numbers. In effect, integer size sets a hardware limit on the range of integers the software run by the CPU can utilize.
Integer range can also affect the number of locations in memory the CPU can address (locate). For example, if a binary CPU uses 32 bits to represent a memory address, and each memory address represents one octet (8 bits), the maximum quantity of memory that CPU can address is 232 octets, or 4 GiB. This is a very simple view of CPU address space, and many modern designs use much more complex addressing methods like paging in order to locate more memory than their integer range would allow with a flat address space.
Higher levels of integer range require more structures to deal with the additional digits, and therefore more complexity, size, power usage, and generally expense. It is not at all uncommon, therefore, to see 4- or 8-bit microcontrollers used in modern applications, even though CPUs with much higher range (such as 16, 32, 64, even 128-bit) are available. The simpler microcontrollers are usually cheaper, use less power, and therefore dissipate less heat, all of which can be major design considerations for electronic devices. However, in higher-end applications, the benefits afforded by the extra range (most often the additional address space) are more significant and often affect design choices. To gain some of the advantages afforded by both lower and higher bit lengths, many CPUs are designed with different bit widths for different portions of the device. For example, the IBM System/370 used a CPU that was primarily 32 bit, but it used 128-bit precision inside its floating point units to facilitate greater accuracy and range in floating point numbers (Amdahl et al. 1964). Many later CPU designs use similar mixed bit width, especially when the processor is meant for general-purpose usage where a reasonable balance of integer and floating point capability is required.
Clock rate

Most CPUs, and indeed most sequential logic devices, are synchronous
in nature. That is, they are designed and operate on assumptions
about a synchronization signal. This signal, known as a clock
signal, usually takes the form of a periodic square wave. By calculating
the maximum time that electrical signals can move in various branches
of a CPU's many circuits, the designers can select an appropriate
period for the clock signal.
This period must be longer than the amount of time it takes for a signal to move, or propagate, in the worst-case scenario. In setting the clock period to a value well above the worst-case propagation delay, it is possible to design the entire CPU and the way it moves data around the "edges" of the rising and falling clock signal. This has the advantage of simplifying the CPU significantly, both from a design perspective and a component-count perspective. However, it also carries the disadvantage that the entire CPU must wait on its slowest elements, even though some portions of it are much faster. This limitation has largely been compensated for by various methods of increasing CPU parallelism (see below).
Architectural improvements alone do not solve all of the drawbacks of globally synchronous CPUs, however. For example, a clock signal is subject to the delays of any other electrical signal. Higher clock rates in increasingly complex CPUs make it more difficult to keep the clock signal in phase (synchronized) throughout the entire unit. This has led many modern CPUs to require multiple identical clock signals to be provided in order to avoid delaying a single signal significantly enough to cause the CPU to malfunction. Another major issue as clock rates increase dramatically is the amount of heat that is dissipated by the CPU. The constantly changing clock causes many components to switch regardless of whether they are being used at that time. In general, a component that is switching uses more energy than an element in a static state. Therefore, as clock rate increases, so does heat dissipation, causing the CPU to require more effective cooling solutions.
One method of dealing with the switching of unneeded components
is called clock gating, which involves turning off the clock signal
to unneeded components (effectively disabling them). However,
this is often regarded as difficult to implement and therefore
does not see common usage outside of very low-power designs. Another
method of addressing some of the problems with a global clock
signal is the removal of the clock signal altogether. While removing
the global clock signal makes the design process considerably
more complex in many ways, asynchronous (or clockless) designs
carry marked advantages in power consumption and heat dissipation
in comparison with similar synchronous designs. While somewhat
uncommon, entire CPUs have been built without utilizing a global
clock signal. Two notable examples of this are the ARM compliant
AMULET and the MIPS R3000 compatible MiniMIPS. Rather than totally
removing the clock signal, some CPU designs allow certain portions
of the device to be asynchronous, such as using asynchronous ALUs
in conjunction with superscalar pipelining to achieve some arithmetic
performance gains. While it is not altogether clear whether totally
asynchronous designs can perform at a comparable or better level
than their synchronous counterparts, it is evident that they do
at least excel in simpler math operations. This, combined with
their excellent power consumption and heat dissipation properties,
makes them very suitable for embedded computers (Garside et al.
1999).
Parallelism

The description of the basic operation of a CPU offered in the previous section describes the simplest form that a CPU can take. This type of CPU, usually referred to as subscalar, operates on and executes one instruction on one or two pieces of data at a time.
This process gives rise to an inherent inefficiency in subscalar CPUs. Since only one instruction is executed at a time, the entire CPU must wait for that instruction to complete before proceeding to the next instruction. As a result the subscalar CPU gets "hung up" on instructions which take more than one clock cycle to complete execution. Even adding a second execution unit (see below) does not improve performance much; rather than one pathway being hung up, now two pathways are hung up and the number of unused transistors is increased. This design, wherein the CPU's execution resources can operate on only one instruction at a time, can only possibly reach scalar performance (one instruction per clock). However, the performance is nearly always subscalar (less than one instruction per cycle).
Attempts to achieve scalar and better performance have resulted in a variety of design methodologies that cause the CPU to behave less linearly and more in parallel. When referring to parallelism in CPUs, two terms are generally used to classify these design techniques. Instruction level parallelism (ILP) seeks to increase the rate at which instructions are executed within a CPU (that is, to increase the utilization of on-die execution resources), and thread level parallelism (TLP) purposes to increase the number of threads (effectively individual programs) that a CPU can execute simultaneously. Each methodology differs both in the ways in which they are implemented, as well as the relative effectiveness they afford in increasing the CPU's performance for an application
ILP: Instruction pipelining and superscalar architecture

One of the simplest methods used to accomplish increased parallelism is to begin the first steps of instruction fetching and decoding before the prior instruction finishes executing. This is the simplest form of a technique known as instruction pipelining, and is utilized in almost all modern general-purpose CPUs. Pipelining allows more than one instruction to be executed at any given time by breaking down the execution pathway into discrete stages. This separation can be compared to an assembly line, in which an instruction is made more complete at each stage until it exits the execution pipeline and is retired.
Pipelining does, however, introduce the possibility for a situation where the result of the previous operation is needed to complete the next operation; a condition often termed data dependency conflict. To cope with this, additional care must be taken to check for these sorts of conditions and delay a portion of the instruction pipeline if this occurs. Naturally, accomplishing this requires additional circuitry, so pipelined processors are more complex than subscalar ones (though not very significantly so). A pipelined processor can become very nearly scalar, inhibited only by pipeline stalls (an instruction spending more than one clock cycle in a stage).

Further improvement upon the idea of instruction pipelining led to the development of a method that decreases the idle time of CPU components even further. Designs that are said to be superscalar include a long instruction pipeline and multiple identical execution units. In a superscalar pipeline, multiple instructions are read and passed to a dispatcher, which decides whether or not the instructions can be executed in parallel (simultaneously). If so they are dispatched to available execution units, resulting in the ability for several instructions to be executed simultaneously. In general, the more instructions a superscalar CPU is able to dispatch simultaneously to waiting execution units, the more instructions will be completed in a given cycle.
Most of the difficulty in the design of a superscalar CPU architecture lies in creating an effective dispatcher. The dispatcher needs to be able to quickly and correctly determine whether instructions can be executed in parallel, as well as dispatch them in such a way as to keep as many execution units busy as possible. This requires that the instruction pipeline is filled as often as possible and gives rise to the need in superscalar architectures for significant amounts of CPU cache. It also makes hazard-avoiding techniques like branch prediction, speculative execution, and out-of-order execution crucial to maintaining high levels of performance. By attempting to predict which branch (or path) a conditional instruction will take, the CPU can minimize the number of times that the entire pipeline must wait until a conditional instruction is completed. Speculative execution often provides modest performance increases by executing portions of code that may or may not be needed after a conditional operation completes. Out-of-order execution somewhat rearranges the order in which instructions are executed to reduce delays due to data dependencies.
In the case where a portion of the CPU is superscalar and part is not, the part which is not suffers a performance penalty due to scheduling stalls. The original Intel Pentium (P5) had two superscalar ALUs which could accept one instruction per clock each, but its FPU could not accept one instruction per clock. Thus the P5 was integer superscalar but not floating point superscalar. Intel's successor to the Pentium architecture, P6, added superscalar capabilities to its floating point features, and therefore afforded a significant increase in floating point instruction performance.
Both simple pipelining and superscalar design increase a CPU's
ILP by allowing a single processor to complete execution of instructions
at rates surpassing one instruction per cycle (IPC). Most modern
CPU designs are at least somewhat superscalar, and nearly all
general purpose CPUs designed in the last decade are superscalar.
In later years some of the emphasis in designing high-ILP computers
has been moved out of the CPU's hardware and into its software
interface, or ISA. The strategy of the very long instruction word
(VLIW) causes some ILP to become implied directly by the software,
reducing the amount of work the CPU must perform to boost ILP
and thereby reducing the design's complexity.
TLP: Simultaneous thread execution
Another strategy commonly used to increase the parallelism of
CPUs is to include the ability to run multiple threads (programs)
at the same time. In general, high-TLP CPUs have been in use much
longer than high-ILP ones. Many of the designs pioneered by Cray
during the late 1970s and 1980s concentrated on TLP as their primary
method of enabling enormous (for the time) computing capability.
In fact, TLP in the form of multiple thread execution improvements
was in use as early as the 1950s (Smotherman 2005). In the context
of single processor design, the two main methodologies used to
accomplish TLP are chip-level multiprocessing (CMP) and simultaneous
multithreading (SMT). On a higher level, it is very common to
build computers with multiple totally independent CPUs in arrangements
like symmetric multiprocessing (SMP) and non-uniform memory access
(NUMA). While using very different means, all of these techniques
accomplish the same goal: increasing the number of threads that
the CPU(s) can run in parallel.
The CMP and SMP methods of parallelism are similar to one another
and the most straightforward. These involve little more conceptually
than the utilization of two or more complete and independent CPUs.
In the case of CMP, multiple processor "cores" are included in
the same package, sometimes on the very same integrated circuit.
SMP, on the other hand, includes multiple independent packages.
NUMA is somewhat similar to SMP but uses a nonuniform memory access
model. This is important for computers with many CPUs because
each processor's access time to memory is quickly exhausted with
SMP's shared memory model, resulting in significant delays due
to CPUs waiting for memory. Therefore, NUMA is considered a much
more scalable model, successfully allowing many more CPUs to be
used in one computer than SMP can feasibly support. SMT differs
somewhat from other TLP improvements in that it attempts to duplicate
as few portions of the CPU as possible. While considered a TLP
strategy, its implementation actually more resembles superscalar
design, and indeed is often used in superscalar microprocessors
(such as IBM's POWER5). Rather than duplicating the entire CPU,
SMT designs only duplicate parts needed for instruction fetching,
decoding, and dispatch, as well as things like general-purpose
registers. This allows an SMT CPU to keep its execution units
busy more often by providing them instructions from two different
software threads. Again, this is very similar to the ILP superscalar
method, but simultaneously executes instructions from multiple
threads rather than executing multiple instructions from the same
thread concurrently.
Vector processors and SIMD
A less common but increasingly important paradigm of CPUs (and
indeed, computing in general) deals with vectors. The processors
discussed earlier are all referred to as some type of scalar device.
As the name implies, vector processors deal with multiple pieces
of data in the context of one instruction. This contrasts with
scalar processors, which deal with one piece of data for every
instruction. These two schemes of dealing with data are generally
referred to as SISD (single instruction, single data) and SIMD
(single instruction, multiple data), respectively. The great utility
in creating CPUs that deal with vectors of data lies in optimizing
tasks that tend to require the same operation (for example, a
sum or a dot product) to be performed on a large set of data.
Some classic examples of these types of tasks are multimedia applications
(images, video, and sound), as well as many types of scientific
and engineering tasks. Whereas a scalar CPU must complete the
entire process of fetching, decoding, and executing each instruction
and value in a set of data, a vector CPU can perform a single
operation on a comparatively large set of data with one instruction.
Of course, this is only possible when the application tends to
require many steps which apply one operation to a large set of
data.
Most early vector CPUs, such as the Cray-1, were associated almost exclusively with scientific research and cryptography applications. However, as multimedia has largely shifted to digital media, the need for some form of SIMD in general-purpose CPUs has become significant. Shortly after floating point execution units started to become commonplace to include in general-purpose processors, specifications for and implementations of SIMD execution units also began to appear for general-purpose CPUs. Some of these early SIMD specifications like Intel's MMX were integer-only. This proved to be a significant impediment for some software developers, since many of the applications that benefit from SIMD primarily deal with floating point numbers. Progressively, these early designs were refined and remade into some of the common, modern SIMD specifications, which are usually associated with one ISA. Some notable modern examples are Intel's SSE and the PowerPC-related AltiVec (also known as VMX).