Ken Shirriff's blog

Notes on the Intel 8086 processor's arithmetic-logic unit

In 1978, Intel introduced the 8086 processor, a revolutionary chip that led to the modern x86 architecture. Unlike modern 64-bit processors, however, the 8086 is a 16-bit chip. Its arithmetic/logic unit (ALU) operates on 16-bit values, performing arithmetic operations such as addition and subtraction, as well as logic operations including bitwise AND, OR, and XOR. The 8086's ALU is a complicated part of the chip, performing 28 operations in total.1

In this post, I discuss the circuitry that controls the ALU, generating the appropriate control signals for a particular operation. The process is more complicated than you might expect. First, a machine code instruction results in the execution of multiple microcode instructions. Using the ALU is a two-step process: one microcode instruction (micro-instruction) configures the ALU for the desired operation, while a second micro-instruction gets the results from the ALU. Moreover, based on both the microcode micro-instruction and the machine code instruction, the control circuitry sends control signals to the ALU, reconfiguring it for the desired operation. Thus, this circuitry provides the "glue" between the micro-instructions and the ALU.

The die photo below shows the 8086 processor under a microscope. I've labeled the key functional blocks. Architecturally, the chip is partitioned into a Bus Interface Unit (BIU) at the top and an Execution Unit (EU) below. The BIU handles bus and memory activity as well as instruction prefetching, while the Execution Unit (EU) executes the instructions. In the lower right corner, the microcode ROM holds the micro-instructions. The ALU is in the lower left corner, with bits 7-0 above and bits 15-8 below, sandwiching the status flag circuitry. The ALU control circuitry, highlighted in red at the bottom of the chip, is the focus of this article.

The die of the 8086. Click this image (or any other) for a larger version.

Microcode

The 8086 processor implements most machine instructions in microcode, with a micro-instruction for each step of the machine instruction. (I discuss the 8086's microcode in detail here.) The 8086 uses an interesting architecture for microcode: each micro-instruction performs two unrelated operations. The first operation moves data between a source and a destination. The second operation can range from a jump or subroutine call to a memory read/write or an ALU operation. An ALU operation has a five-bit field to specify a particular operation and a two-bit field to specify which temporary register provides the input. As you'll see below, these two fields play an important role in the ALU circuitry.

In many cases, the 8086's micro-instruction doesn't specify the ALU operation, leaving the details to be substituted from the machine instruction opcode. For instance, the ADD, SUB, ADC, SBB, AND, OR, XOR, and CMP machine instructions share the same microcode, while the hardware selects the ALU operation from the instruction opcode. Likewise, the increment and decrement instructions use the same microcode, as do the decimal adjust instructions DAA and DAS, and the ASCII adjust instructions AAA and AAS. Inside the micro-instruction, all these operations are performed with a "pseudo" ALU operation called XI (for some reason). If the microcode specifies an XI ALU operation, the hardware replaces it with the ALU operation specified in the instruction. Another important feature of the microcode is that you need to perform one ALU micro-instruction to configure the ALU's operation, but the result isn't available until a later micro-instruction, which moves the result to a destination. This has the consequence that the hardware must remember the ALU operation.

To make this concrete, here is the microcode that implements a typical arithmetic instruction such as ADD AL, BL or XOR [BX+DI], CX. This microcode consists of three micro-instructions. The left half of each micro-instruction specifies a data movement, first moving the two arguments to ALU temporary registers and then storing the ALU result (called Σ). The right half of each micro-instruction performs the second task. First, the ALU is configured to perform an XI operation using temporary register A. Recall that XI indicates the ALU operation is filled in from the machine instruction; this is how the same microcode handles eight different types of machine instructions. In the second micro-instruction, the next machine instruction is started unless a memory writeback is required (WB). The last micro-instruction is RNI (Run Next Instruction) to start a new machine instruction. It also indicates that the processor status flags (F) should be updated to indicate if the ALU result is zero, positive, overflow, and so forth.2

M → tmpa   XI   tmpa  Load first argument, configure ALU.
R → tmpb   WB,NXT     Load second argument, start Next instruction if no memory writeback
Σ → M      RNI  F     Store ALU result, Run Next Instruction, update status Flags

The ALU circuit

The ALU is the heart of a processor, performing arithmetic and logic operations. Microprocessors of the 1970s typically supported addition and subtraction; logical AND, OR, and XOR; and various bit shift operations. (Although the 8086 had multiply and divide instructions, these were implemented in microcode, not in the ALU.) Since an ALU is both large and critical to performance, chip architects try to optimize its design. As a result, different microprocessors have widely different ALU designs. For instance, the 6502 microprocessor has separate circuits for addition and each logic operation; a multiplexer selects the appropriate output. The Intel 8085, on the other hand, uses an optimized clump of gates that performs the desired operation based on control signals (details), while the Z80's 4-bit ALU uses a different clump of gates (details).

The 8086 takes a different approach, using two lookup tables (along with other gates) to generate the carry and output signals for each bit in the ALU. By setting the lookup tables appropriately, the ALU can be configured to perform the desired operation. (This is similar to how an FPGA implements arbitrary functions through lookup tables.) The schematic below shows the circuit for one bit of the ALU. I won't explain this circuit in detail since I explained it in an earlier article.3 The relevant part of this circuit is the six control signals at the left. The two multiplexers (trapezoidal symbols) implement the lookup tables by using the two input argument bits to select outputs from the control signals to control carry generation and carry propagation. Thus, by feeding appropriate control signals into the ALU, the 8086 can reconfigure the ALU to perform the desired operation. For instance, with one set of control signals, this circuit will add. Other sets of control signals will cause the circuit to subtract or compute a logical operation, such as AND or XOR. The 8086 has 16 copies of this circuit, so it operates on 16-bit values.

The circuit that implements one bit in the 8086's ALU.

The 8086 is a complicated processor, and its instructions have many special cases, so controlling the ALU is more complex than described above. For instance, the compare operation is the same as a subtraction, except the numerical result of a compare is discarded; just the status flags are updated. The add versus add-with-carry instructions require different values for the carry into bit 0, while subtraction requires the carry flag to be inverted since it is treated as a borrow. The 8086's ALU supports increment and decrement operations, but also increment and decrement by 2, which requires an increment signal into bit 1 instead of bit 0. The bit-shift operations all require special treatment. For instance, a rotate can use the carry bit or exclude the carry bit, while and arithmetic shift right requires the top bit to be duplicated. As a result, along with the six lookup table (LUT) control signals, the ALU also requires numerous control signals to adjust its behavior for specific instructions. In the next section, I'll explain how these control signals are generated.

ALU control circuitry on the die

The diagram below shows the components of the ALU control logic as they appear on the die. The information from the micro-instruction enters at the right and is stored in the latches. The PLAs (Programmable Logic Arrays) decode the instruction and generate the control signals. These signals flow to the left, where they control the ALU.

The ALU control logic as it appears on the die. I removed the metal layer to show the underlying polysilicon and silicon. The reddish lines are remnants of the metal.

As explained earlier, if the microcode specifies the XI operation, the operation field is replaced with a value based on the machine instruction opcode. This substitution is performed by the XI multiplexer before the value is stored in the operation latch. Because of the complexity of the 8086 instruction set, the XI operation is not as straightforward as you might expect. This multiplexer gets three instruction bits from a special register called the "X" register, another instruction bit from the instruction register, and the final bit from a decoding circuit called the Group Decode ROM.4

Recall that one micro-instruction specifies the ALU operation, and a later micro-instruction accesses the result. Thus, the ALU control circuitry must remember the specified operation so it can be used later. In particular, the control circuitry must keep track of the ALU operation to perform and the temporary register specified. The control circuitry uses three flip-flops to keep track of the specified temporary register, one flip-flop for each register. The micro-instruction contains a two-bit field that specifies the temporary register. The control circuitry decodes this field and activates the associated flip-flop. The outputs from these flip-flops go to the ALU and enable the associated temporary register. At the start of each machine instruction,5 the flip-flops are reset, so temporary register A is selected by default.

The control circuitry uses five flip-flops to store the five-bit operation field from the micro-instruction. At the start of each machine instruction, the flip-flops are reset so operation 0 (ADD) is specified by default. One important consequence is that an add operation can potentially be performed without a micro-instruction to configure the ALU, shortening the microcode by one micro-instruction and thus shortening the instruction time by one cycle.

The five-bit output from the operation flip-flops goes to the operation PLA (Programmable Logic Array)7, which decodes the operation into 27 control signals.6 Many of these signals go to the ALU, where they control the behavior of the ALU for special cases. About 15 of these signals go to the Lookup Table (LUT) PLA, which generates the six lookup table signals for the ALU. At the left side of the LUT PLA, special high-current driver circuits amplify the control signals before they are sent to the ALU. Details on these drivers are in the footnotes.8

Conclusions

Whenever I look at the circuitry of the 8086 processor, I see the differences between a RISC chip and a CISC chip. In a RISC (Reduced Instruction Set Computer) processor such as ARM, instruction decoding is straightforward, as is the processor circuitry. But in the 8086, a CISC (Complex Instruction Set Computer) processor, there are corner cases and complications everywhere. For instance, an 8086 machine instruction sometimes specifies the ALU operation in the first byte and sometimes in the second byte, and sometimes elsewhere, so the X register latch, the XI multiplexer, and the Group Decode ROM are needed. The 8086's ALU includes obscure operations including four types of BCD adjustments and seven types of shifts, making the ALU more complicated. Of course, the continuing success of x86 shows that this complexity also has benefits.

This article has been a deep dive into the details of the 8086's ALU, but I hope you have found it interesting. If it's too much detail for you, you might prefer my overview of the 8086 ALU.

For updates, follow me on Bluesky (@righto.com), Mastodon (@[email protected]), or RSS.

Credits: Thanks to Marcin Peczarski for discussion. My microcode analysis is based on Andrew Jenner's 8086 microcode disassembly.

Notes and references

The operations implemented by the ALU are:

00	ADD	Add
01	OR	Logical OR
02	ADC	Add with carry in
03	SBB	Subtract with borrow in
04	AND	Logical AND
05	SUBT	Subtract
06	XOR	Logical XOR
07	CMP	Comparison
08	ROL	Rotate left
09	ROR	Rotate right
0a	LRCY	Left rotate through carry
0b	RRCY	Right rotate through carry
0c	SHL	Shift left
0d	SHR	Shift right
0e	SETMO	Set to minus one (questionable)
0f	SAR	Arithmetic shift right
10	PASS	Pass argument unchanged
11	XI	Instruction specifies ALU op
14	DAA	Decimal adjust after addition
15	DAS	Decimal adjust after subtraction
16	AAA	ASCII adjust after addition
17	AAS	ASCII adjust after subtraction
18	INC	Increment
19	DEC	Decrement
1a	COM1	1's complement
1b	NEG	Negate
1c	INC2	Increment by 2
1d	DEC2	Decrement by 2

Also see Andrew Jenner's code. ↩

You might wonder how this microcode handles the 8086's complicated addressing modes such as [BX+DI]. The trick is that microcode subroutines implement the addressing modes. For details, see my article on 8086 addressing microcode. ↩
The 8086's ALU has a separate circuit to implement shift-right. The problem is that data in an ALU normally flows right-to-left as carries flow from lower bits to higher bits. Shifting data to the right goes against this direction, so it requires a special path. (Shifting to the left is straightforward; you can add a number to itself.)

The adjust operations (DAA, DAS, AAA, AAS) also use completely separate circuitry. These operations generate correction factors for BCD (binary-coded decimal) arithmetic based on the value and flags. The circuitry for these operations is located with the flags circuitry, separate from the rest of the ALU circuitry. ↩
In more detail, the 8086 stores bits 5-3 of the machine instruction in the "X" register. For an XI operation, the X register bits become bits 2-0 of the ALU operation specification, while bit 3 comes from bit 6 of the instruction, and bit 4 comes from the Group Decode ROM for certain instructions. The point of this is that the instruction set is designed so bits of the instruction correspond to bits of the ALU operation specifier, but the mapping is more complicated than you might expect. The eight basic arithmetic/logic operations (ADD, SUB, OR, etc) have a straightforward mapping that is visible from the 8086 opcode table, but the mapping for other instructions isn't as obvious. Moreover, sometimes the operation is specified in the first byte of the machine instruction, but sometimes it is specified in the second byte, which is why the X register needs to store the relevant bits. ↩
The flip-flops are reset by a signal in the 8086, called "Second Clock". When a new machine instruction is started, the "First Clock" signal is generated on the instruction's first byte and the "Second Clock" signal is generated on the instruction's second byte. (Note that these signals are not necessarily on consecutive clock cycles, because a memory fetch may be required if the instruction queue is empty.) Why are the flip-flops reset on Second Clock and not First Clock? The 8086 has a small degree of pipelining, so the previous micro-instruction may still be finishing up during First Clock of the next instruction. By Second Clock, it is safe to reset the ALU state. ↩
For reference, the 27 outputs from the PLA are triggered by the following ALU micro-operations:

Output 0: RRCY (right rotate through carry)
Output 1: ROR (Rotate Right)
Output 2: BCD Adjustments: DAA (Decimal Adjust after Addition), DAS (Decimal Adjust after Subtraction), AAA (ASCII Adjust after Subtraction), or AAS (ASCII Adjust after Subtraction)
Output 3: SAR (Shift Arithmetic Right)
Output 4: Left shift: ROL (Rotate Left), RCL (Rotate through Carry Left), SHL (Shift Left), or SETMO (Set Minus One)
Output 5: Right shift: ROR (Rotate Right), RCR (Rotate through Carry Right), SHR (Shift Right), or SAR (Shift Arithmetic Right)
Output 6: INC2 (increment by 2)
Output 7: ROL (Rotate Left)
Output 8: RCL (Rotate through Carry Left)
Output 9: ADC (add with carry)
Output 10: DEC2 (decrement by 2)
Output 11: INC (increment)
Output 12: NEG (negate)
Output 13: ALU operation 12 (unused?)
Output 14: SUB (Subtract), CMP (Compare), DAS (Decimal Adjust after Subtraction), AAS (ASCII Adjust after Subtraction)
Output 15: SBB (Subtract with Borrow)
Output 16: ROL (Rotate Left) or RCL (Rotate through Carry Left)
Output 17: ADD or ADC (Add with Carry)
Output 18: DEC or DEC2 (Decrement by 1 or 2)
Output 19: PASS (pass-through) or INC (Increment)
Output 20: COM1 (1's Complement) or NEG (Negate)
Output 21: XOR
Output 22: OR
Output 23: AND
Output 24: SHL (Shift Left)
Output 25: DAA or AAA (Decimal/ASCII Adjust after Addition)
Output 26: CMP (Compare) ↩
A Programmable Logic Array is a way of implementing logic gates in a structured grid. PLAs are often used in microprocessors because they provide a dense way of implementing logic. A PLA normally consists of two layers: an "OR" layer and an "AND" layer. Together, the layers produce "sum-of-products" outputs, consisting of multiple terms OR'd together. The ALU's PLA is a bit unusual because many outputs are taken directly from the OR layer, while only about 15 outputs from the first layer are fed into the second layer. ↩
The control signals pass through the driver circuit below. The operation of this circuit puzzled me for years, since the transistor with its gate at +5V seems to be stuck on. But I was looking at the book DRAM Circuit Design and spotted the same circuit, called the "Bootstrap Wordline Driver". The purpose of this circuit is to boost the output to a higher voltage than a regular NMOS circuit, providing better performance. The problem with NMOS circuitry is that NMOS transistors aren't very good at pulling a signal high: due to the properties of the transistor, the output voltage is less than the gate voltage, lower by the threshold voltage V_TH, half a volt or more.

The drive signals to the ALU gates are generated with this dynamic circuit.

The bootstrap circuit takes advantage of capacitance to get more voltage out of the circuit. Specifically, suppose the input is +5V, while the clock is high. Point A will be about 4.5V, losing half a volt due to the threshold. Now, suppose the clock goes low, so the inverted clock driving the upper transistor goes high. Due to capacitance in the second transistor, as the source and drain go high, the gate will be pulled above its previous voltage, maybe gaining a couple of volts. The high voltage on the gate produces a full-voltage output, avoiding the drop due to V_TH. But why the transistor with its gate at +5V? This transistor acts somewhat like a diode, preventing the boosted voltage from flowing backward through the input and dissipating.

The bootstrap circuit is used on the ALU's lookup table control signals for two reasons. First, these control signals drive pass transistors. A pass transistor suffers from a voltage drop due to the threshold voltage, so you want to start with a control signal with as high a voltage as possible. Second, each control signal is connected to 16 transistors (one for each bit). This is a large number of transistors to drive from one signal, since each transistor has gate capacitance. Increasing the voltage helps overcome the R-C (resistor-capacitor) delay, improving performance.

A close-up of the bootstrap drive circuits, in the left half of the LUT PLA.

The diagram above shows six bootstrap drivers on the die. At the left are the transistors that ground the signals when clock is high. The +5V transistors are scattered around the image; two of them are labeled. The six large transistors provide the output signal, controlled by clock'. Note that these transistors are much larger than the other transistors because they must produce the high-current output, while the other transistors have more of a supporting role.

(Bootstrap circuits go way back; Federico Faggin designed a bootstrap circuit for the Intel 8008 that he claimed "proved essential to the microprocessor realization.") ↩

Conditions in the Intel 8087 floating-point chip's microcode

In the 1980s, if you wanted your computer to do floating-point calculations faster, you could buy the Intel 8087 floating-point coprocessor chip. Plugging it into your IBM PC would make operations up to 100 times faster, a big boost for spreadsheets and other number-crunching applications. The 8087 uses complicated algorithms to compute trigonometric, logarithmic, and exponential functions. These algorithms are implemented inside the chip in microcode. I'm part of a group that is reverse-engineering this microcode. In this post, I examine the 49 types of conditional tests that the 8087's microcode uses inside its algorithms. Some conditions are simple, such as checking if a number is zero or negative, while others are specialized, such as determining what direction to round a number.

To explore the 8087's circuitry, I opened up an 8087 chip and took numerous photos of the silicon die with a microscope. Around the edges of the die, you can see the hair-thin bond wires that connect the chip to its 40 external pins. The complex patterns on the die are formed by its metal wiring, as well as the polysilicon and silicon underneath. The bottom half of the chip is the "datapath", the circuitry that performs calculations on 80-bit floating point values. At the left of the datapath, a constant ROM holds important constants such as π. At the right are the eight registers that the programmer uses to hold floating-point values; in an unusual design decision, these registers are arranged as a stack.

Die of the Intel 8087 floating point unit chip, with main functional blocks labeled. The die is 5mm×6mm. Click for a larger image.

The chip's instructions are defined by the large microcode ROM in the middle. To execute a floating-point instruction, the 8087 decodes the instruction and the microcode engine starts executing the appropriate micro-instructions from the microcode ROM. The microcode decode circuitry to the right of the ROM generates the appropriate control signals from each micro-instruction.1 The bus registers and control circuitry handle interactions with the main 8086 processor and the rest of the system.

The 8087's microcode

Executing an 8087 instruction such as arctan requires hundreds of internal steps to compute the result. These steps are implemented in microcode with micro-instructions specifying each step of the algorithm. (Keep in mind the difference between the assembly language instructions used by a programmer and the undocumented low-level micro-instructions used internally by the chip.) The microcode ROM holds 1648 micro-instructions, implementing the 8087's instruction set. Each micro-instruction is 16 bits long and performs a simple operation such as moving data inside the chip, adding two values, or shifting data. I'm working with the "Opcode Collective" to reverse engineer the micro-instructions and fully understand the microcode (link).

The microcode engine (below) controls the execution of micro-instructions, acting as the mini-CPU inside the 8087. Specifically, it generates an 11-bit micro-address, the address of a micro-instruction in the ROM. The microcode engine implements jumps, subroutine calls, and returns within the microcode. These jumps, subroutine calls, and returns are all conditional; the microcode engine will either perform the operation or skip it, depending on the value of a specified condition.

The microcode engine. In this image, the metal is removed, showing the underlying silicon and polysilicon.

I'll write more about the microcode engine later, but I'll give an overview here. At the top, the Instruction Decode PLA2 decodes an 8087 instruction to determine the starting address in microcode. Below that, the Jump PLA holds microcode addresses for jumps and subroutine calls. Below this, six 11-bit registers implement the microcode stack, allowing six levels of subroutine calls inside the microcode. (Note that this stack is completely different from the 8087's register stack that holds eight floating-point values.) The stack registers have associated read/write circuitry. The incrementer adds one to the micro-address to step through the code. The engine also implements relative jumps, using an adder to add an offset to the current location. At the bottom, the address latch and drivers boost the 11-bit address output and send it to the microcode ROM.

Selecting a condition

A micro-instruction can say "jump ahead 5 micro-instructions if a register is zero" and the microcode engine will either perform the jump or ignore it, based on the register value. In the circuitry, the condition causes the microcode engine to either perform the jump or block the jump. But how does the hardware select one condition out of the large set of conditions?

Six bits of the micro-instruction can specify one of 64 conditions. A circuit similar to the idealized diagram below selects the specified condition. The key component is a multiplexer, represented by a trapezoid below. A multiplexer is a simple circuit that selects one of its four inputs. By arranging multiplexers in a tree, one of the 64 conditions on the left is selected and becomes the output, passed to the microcode engine.

A tree of multiplexers selects one of the conditions. This diagram is simplified.

For example, if bits J and K of the microcode are 00, the rightmost multiplexer will select the first input. If bits LM are 01, the middle multiplexer will select the second input, and if bits NO are 10, the left multiplexer will select its third input. The result is that condition 06 will pass through the tree and become the output.3 By changing the bits that control the multiplexers, any of the inputs can be used. (We've arbitrarily given the 16 microcode bits the letter names A through P.)

Physically, the conditions come from locations scattered across the die. For instance, conditions involving the opcode come from the instruction decoding part of the chip, while conditions involving a register are evaluated next to the register. It would be inefficient to run 64 wires for all the conditions to the microcode engine. The tree-based approach reduces the wiring since the "leaf" multiplexers can be located near the associated condition circuitry. Thus, only one wire needs to travel a long distance rather than multiple wires. In other words, the condition selection circuitry is distributed across the chip instead of being implemented as a centralized module.

Because the conditions don't always fall into groups of four, the actual implementation is slightly different from the idealized diagram above. In particular, the top-level multiplexer has five inputs, rather than four.4 Other multiplexers don't use all four inputs. This provides a better match between the physical locations of the condition circuits and the multiplexers. In total, 49 of the possible 64 conditions are implemented in the 8087.

The circuit that selects one of the four conditions is called a multiplexer. It is constructed from pass transistors, transistors that are configured to either pass a signal through or block it. To operate the multiplexer, one of the select lines is energized, turning on the corresponding pass transistor. This allows the selected input to pass through the transistor to the output, while the other inputs are blocked.

A 4-1 multiplexer, constructed from four pass transistors.

The diagram below shows how a multiplexer appears on the die. The pinkish regions are doped silicon. The white lines are polysilicon wires. When polysilicon crosses over doped silicon, a transistor is formed. On the left is a four-way multiplexer, constructed from four pass transistors. It takes inputs (black) for four conditions, numbered 38, 39, 3a, and 3b. There are four control signals (red) corresponding to the four combinations of bits N and O. One of the inputs will pass through a transistor to the output, selected by the active control signal. The right half contains the logic (four NOR gates and two inverters) to generate the control signals from the microcode bits. (Metal lines run horizontally from the logic to the control signal contacts, but I dissolved the metal for this photo.) Each multiplexer in the 8087 has a completely different layout, manually optimized based on the location of the signals and surrounding circuitry. Although the circuit for a multiplexer is regular (four transistors in parallel), the physical layout looks somewhat chaotic.

Multiplexers as they appear on the die. The metal layer has been removed to show the polysilicon and silicon. The "tie-die" patterns are due to thin-film effects where the oxide layer wasn't completely removed.

The 8087 uses pass transistors for many circuits, not just multiplexers. Circuits with pass transistors are different from regular logic gates because the pass transistors provide no amplification. Instead, signals get weaker as they go through pass transistors. To solve this problem, inverters or buffers are inserted into the condition tree to boost signals; they are omitted from the diagram above.

The conditions

Of the 8087's 49 different conditions, some are widely used in the microcode, while others are designed for a specific purpose and are only used once. The full set of conditions is described in a footnote7 but I'll give some highlights here.

Fifteen conditions examine the bits of the current instruction's opcode. This allows one microcode routine to handle a group of similar instructions and then change behavior based on the specific instruction. For example, conditions test if the instruction is multiplication, if the instruction is an FILD/FIST (integer load or store), or if the bottom bit of the opcode is set.5

The 8087 has three temporary registers—tmpA, tmpB, and tmpC—that hold values during computation. Various conditions examine the values in the tmpA and tmpB registers.6 In particular, the 8087 uses an interesting way to store numbers internally: each 80-bit floating-point value also has two "tag" bits. These bits are mostly invisible to the programmer and can be thought of as metadata. The tag bits indicate if a register is empty, contains zero, contains a "normal" number, or contains a special value such as NaN (Not a Number) or infinity. The 8087 uses the tag bits to optimize operations. The tags also detect stack overflow (storing to a non-empty stack register) or stack underflow (reading from an empty stack register).

Other conditions are highly specialized. For instance, one condition looks at the rounding mode setting and the sign of the value to determine if the value should be rounded up or down. Other conditions deal with exceptions such as numbers that are too small (i.e. denormalized) or numbers that lose precision. Another condition tests if two values have the same sign or not. Yet another condition tests if two values have the same sign or not, but inverts the result if the current instruction is subtraction. The simplest condition is simply "true", allowing an unconditional branch.

For flexibility, conditions can be "flipped", either jumping if the condition is true or jumping if the condition is false. This is controlled by bit P of the microcode. In the circuitry, this is implemented by a gate that XORs the P bit with the condition. The result is that the state of the condition is flipped if bit P is set.

For a concrete example of how conditions are used, consider the microcode routine that implements FCHS and FABS, the instructions to change the sign and compute the absolute value, respectively. These operations are almost the same (toggling the sign bit versus clearing the sign bit), so the same microcode routine handles both instructions, with a jump instruction to handle the difference. The FABS and FCHS instructions were designed with identical opcodes, except that the bottom bit is set for FABS. Thus, the microcode routine uses a condition that tests the bottom bit, allowing the routine to branch and change its behavior for FABS vs FCHS.

Looking at the relevant micro-instruction, it has the hex value 0xc094, or in binary 110 000001 001010 0. The first three bits (ABC=110) specify the relative jump operation (100 would jump to a fixed target and 101 would perform a subroutine call.) Bits D through I (000010) indicate the amount of the jump (+`). Bits J through O (001010, hex 0a) specify the condition to test, in this case, the last bit of the instruction opcode. The final bit (P) would toggle the condition if set, (i.e. jump if false). Thus, for FABS, the jump instruction will jump ahead one micro-instruction. This has the effect of skipping the next micro-instruction, which sets the appropriate sign bit for FCHS.

Conclusions

The 8087 performs floating-point operations much faster than the 8086 by using special hardware, optimized for floating-point. The condition code circuitry is one example of this: the 8087 can test a complicated condition in a single operation. However, these complicated conditions make it much harder to understand the microcode. But by a combination of examining the circuitry and looking at the micocode, we're making progress. Thanks to the members of the "Opcode Collective" for their hard work, especially Smartest Blob and Gloriouscow.

For updates, follow me on Bluesky (@righto.com), Mastodon (@[email protected]), or RSS.

Notes and references

The section of the die that I've labeled "Microcode decode" performs some of the microcode decoding, but large parts of the decoding are scattered across the chip, close to the circuitry that needs the signals. This makes reverse-engineering the microcode much more difficult. I thought that understanding the microcode would be straightforward, just examining a block of decode circuitry. But this project turned out to be much more complicated and I need to reverse-engineer the entire chip. ↩
A PLA is a "Programmable Logic Array". It is a technique to implement logic functions with grids of transistors. A PLA can be used as a compressed ROM, holding data in a more compact representation. (Saving space was very important in chips of this era.) In the 8087, PLAs are used to hold tables of microcode addresses. ↩
Note that the multiplexer circuit selects the condition corresponding to the binary value of the bits. In the example, bits 000110 (0x06) select condition 06. ↩
The five top-level multiplexer inputs correspond to bit patterns 00, 011, 10, 110, and 111. That is, two inputs depend on bits J and K, while three inputs depend on bits J, K, and L. The bit pattern 010 is unused, corresponding to conditions 0x10 through 0x17, which aren't implemented. ↩
The 8087 acts as a co-processor with the 8086 processor. The 8086 instruction set is designed so instructions with a special "ESCAPE" sequence in the top 5 bits are processed by the co-processor, in this case the 8087. Thus, the 8087 receives a 16-bit instruction, but only the bottom 11 bits are usable. For a memory operation, the second byte of the instruction is an 8086-style ModR/M byte. For instructions that don't access memory, the second byte specifies more of the instruction and sometimes specifies the stack register to use for the instruction.

The relevance of this is that the 8087's microcode engine uses the 11 bits of the instruction to determine which microcode routine to execute. The microcode also uses various condition codes to change behavior depending on different bits of the instruction. ↩
There is a complication with the tmpA and tmpB registers: they can be swapped with the micro-instruction "ABC.EF". The motivation behind this is that if you have two arguments, you can use a micro-subroutine to load an argument into tmpA, swap the registers, and then use the same subroutine to load the second argument into tmpA. The result is that the two arguments end up in tmpB and tmpA without any special coding in the subroutine.

The implementation doesn't physically swap the registers, but renames them internally, which is much more efficient. A flip-flop is toggled every time the registers are swapped. If the flip-flop is set, a request goes to one register, while if the flip-flop is clear, a request goes to the other register. (Many processors use the same trick. For instance, the Intel 8080 has an instruction to exchange the DE and HL registers. The Z80 has an instruction to swap register banks. In both cases, a flip-flop renames the registers, so the data doesn't need to move.) ↩

The table below is the real meat of this post, the result of much circuit analysis. These details probably aren't interesting to most people, so I've relegated the table to a footnote. Descriptions in italics are provided by Smartest Blob based on examination of the microcode. Grayed-out lines are unused conditions.

The table has five sections, corresponding to the 5 inputs to the top-level condition multiplexer. These inputs come from different parts of the chip, so the sections correspond to different categories of conditions.

The first section consists of instruction parsing, with circuitry near the microcode engine. The description shows the 11-bit opcode pattern that triggers the condition, with 0 bits and 1 bits as specified, and X indicating a "don't care" bit that can be 0 or 1. Where simpler, I list the relevant instructions instead.

The next section indicates conditions on the exponent. I am still investigating these conditions, so the descriptions are incomplete. The third section is conditions on the temporary registers or conditions related to the control register. These circuits are to the right of the microcode ROM.

Conditions in the fourth section examine the floating-point bus, with circuitry near the bottom of the chip. Conditions 34 and 35 use a special 16-bit bidirectional shift register, at the far right of the chip. The top bit from the floating-point bus is shifted in. Maybe this shift register is used for CORDIC calculations? The conditions in the final block are miscellaneous, including the always-true condition 3e, which is used for unconditional jumps.

Cond.	Description
00	not XXX 11XXXXXX
01	1XX 11XXXXXX
02	0XX 11XXXXXX
03	X0X XXXXXXXX
04	not cond 07 or 1XX XXXXXXXX
05	not FLD/FSTP temp-real or BCD
06	110 xxxxxxxx or 111 xx0xxxxx
07	FLD/FSTP temp-real
08	FBLD/FBSTP
09
0a	XXX XXXXXXX1
0b	XXX XXXX1XXX
0c	FMUL
0d	FDIV FDIVR
0e	FADD FCOM FCOMP FCOMPP FDIV FDIVR FFREE FLD FMUL FST FSTP FSUB FSUBR FXCH
0f	FCOM FCOMP FCOMPP FTST
10
11
12
13
14
15
16
17
18	exponent condition
19	exponent condition
1a	exponent condition
1b	exponent condition
1c	exponent condition
1d	exponent condition
1e	eight exponent zero bits
1f	exponent condition
20	tmpA tag ZERO
21	tmpA tag SPECIAL
22	tmpA tag VALID
23	stack overflow
24	tmpB tag ZERO
25	tmpB tag SPECIAL
26	tmpB tag VALID
27	st(i) doesn't exist (A)?
28	tmpA sign
29	tmpB top bit
2a	tmpA zero
2b	tmpA top bit
2c	Control Reg bit 12: infinity control
2d	round up/down
2e	unmasked interrupt
2f	DE (denormalized) interrupt
30	top reg bit
31
32	reg bit 64
33	reg bit 63
34	Shifted top bits, all zero
35	Shifted top bits, one out
36
37
38	const latch zero
39	tmpA vs tmpB sign, flipped for subtraction
3a	precision exception
3b	tmpA vs tmpB sign
3c
3d
3e	unconditional
3f

This table is under development and undoubtedly has errors. ↩

The stack circuitry of the Intel 8087 floating point chip, reverse-engineered

Early microprocessors were very slow when operating with floating-point numbers. But in 1980, Intel introduced the 8087 floating-point coprocessor, performing floating-point operations up to 100 times faster. This was a huge benefit for IBM PC applications such as AutoCAD, spreadsheets, and flight simulators. The 8087 was so effective that today's computers still use a floating-point system based on the 8087.1

The 8087 was an extremely complex chip for its time, containing somewhere between 40,000 and 75,000 transistors, depending on the source.2 To explore how the 8087 works, I opened up a chip and took numerous photos of the silicon die with a microscope. Around the edges of the die, you can see the hair-thin bond wires that connect the chip to its 40 external pins. The complex patterns on the die are formed by its metal wiring, as well as the polysilicon and silicon underneath. The bottom half of the chip is the "datapath", the circuitry that performs calculations on 80-bit floating point values. At the left of the datapath, a constant ROM holds important constants such as π. At the right are the eight registers that form the stack, along with the stack control circuitry.

Die of the Intel 8087 floating point unit chip, with main functional blocks labeled. The die is 5mm×6mm. Click for a larger image.

The chip's instructions are defined by the large microcode ROM in the middle. This ROM is very unusual; it is semi-analog, storing two bits per transistor by using four transistor sizes. To execute a floating-point instruction, the 8087 decodes the instruction and the microcode engine starts executing the appropriate micro-instructions from the microcode ROM. The decode circuitry to the right of the ROM generates the appropriate control signals from each micro-instruction. The bus registers and control circuitry handle interactions with the main 8086 processor and the rest of the system. Finally, the bias generator uses a charge pump to create a negative voltage to bias the chip's substrate, the underlying silicon.

The stack registers and control circuitry (in red above) are the subject of this blog post. Unlike most processors, the 8087 organizes its registers in a stack, with instructions operating on the top of the stack. For instance, the square root instruction replaces the value on the top of the stack with its square root. You can also access a register relative to the top of the stack, for instance, adding the top value to the value two positions down from the top. The stack-based architecture was intended to improve the instruction set, simplify compiler design, and make function calls more efficient, although it didn't work as well as hoped.

The stack on the 8087. From The 8087 Primer, page 60.

The diagram above shows how the stack operates. The stack consists of eight registers, with the Stack Top (ST) indicating the current top of the stack. To push a floating-point value onto the stack, the Stack Top is decremented and then the value is stored in the new top register. A pop is performed by copying the value from the stack top and then incrementing the Stack Top. In comparison, most processors specify registers directly, so register 2 is always the same register.

The registers

The stack registers occupy a substantial area on the die of the 8087 because floating-point numbers take many bits. A floating-point number consists of a fractional part (sometimes called the mantissa or significand), along with the exponent part; the exponent allows floating-point numbers to cover a range from extremely small to extremely large. In the 8087, floating-point numbers are 80 bits: 64 bits of significand, 15 bits of exponent, and a sign bit. An 80-bit register was very large in the era of 8-bit or 16-bit computers; the eight registers in the 8087 would be equivalent to 40 registers in the 8086 processor.

The registers in the 8087 form an 8×80 grid of cells. The close-up shows an 8×8 block. I removed the metal layer with acid to reveal the underlying silicon circuitry.

The registers store each bit in a static RAM cell. Each cell has two inverters connected in a loop. This circuit forms a stable feedback loop, with one inverter on and one inverter off. Depending on which inverter is on, the circuit stores a 0 or a 1. To write a new value into the circuit, one of the lines is pulled low, flipping the loop into the desired state. The trick is that each inverter uses a very weak transistor to pull the output high, so its output is easily overpowered to change the state.

Two inverters in a loop can store a 0 or a 1.

These inverter pairs are arranged in an 8 × 80 grid that implements eight words of 80 bits. Each of the 80 rows has two bitlines that provide access to a bit. The bitlines provide both read and write access to a bit; the pair of bitlines allows either inverter to be pulled low to store the desired bit value. Eight vertical wordlines enable access to one word, one column of 80 bits. Each wordline turns on 160 pass transistors, connecting the bitlines to the inverters in the selected column. Thus, when a wordline is enabled, the bitlines can be used to read or write that word.

Although the chip looks two-dimensional, it actually consists of multiple layers. The bottom layer is silicon. The pinkish regions below are where the silicon has been "doped" to change its electrical properties, making it an active part of the circuit. The doped silicon forms a grid of horizontal and vertical wiring, with larger doped regions in the middle. On top of the silicon, polysilicon wiring provides two functions. First, it provides a layer of wiring to connect the circuit. But more importantly, when polysilicon crosses doped silicon, it forms a transistor. The polysilicon provides the gate, turning the transistor on and off. In this photo, the polysilicon is barely visible, so I've highlighted part of it in red. Finally, horizontal metal wires provide a third layer of interconnecting wiring. Normally, the metal hides the underlying circuitry, so I removed the metal with acid for this photo. I've drawn blue lines to represent the metal layer. Contacts provide connections between the various layers.

A close-up of a storage cell in the registers. The metal layer and most of the polysilicon have been removed to show the underlying silicon.

The layers combine to form the inverters and selection transistors of a memory cell, indicated with the dotted line below. There are six transistors (yellow), where polysilicon crosses doped silicon. Each inverter has a transistor that pulls the output low and a weak transistor to pull the output high. When the word line (vertical polysilicon) is active, it connects the selected inverters to the bit lines (horizontal metal) through the two selection transistors. This allows the bit to be read or written.

The function of the circuitry in a storage cell.

Each register has two tag bits associated with it, an unusual form of metadata to indicate if the register is empty, contains zero, contains a valid value, or contains a special value such as infinity. The tag bits are used to optimize performance internally and are mostly irrelevant to the programmer. As well as being accessed with a register, the tag bits can be accessed in parallel as a 16-bit "Tag Word". This allows the tags to be saved or loaded as part of the 8087's state, for instance, during interrupt handling.

The decoder

The decoder circuit, wedged into the middle of the register file, selects one of the registers. A register is specified internally with a 3-bit value. The decoder circuit energizes one of the eight register select lines based on this value.

The decoder circuitry is straightforward: it has eight 3-input NOR gates to match one of the eight bit patterns. The select line is then powered through a high-current driver that uses large transistors. (In the photo below, you can compare the large serpentine driver transistors to the small transistors in a bit cell.)

The decoder circuitry has eight similar blocks to drive the eight select lines.

The decoder has an interesting electrical optimization. As shown earlier, the register select lines are eight polysilicon lines running vertically, the length of the register file. Unfortunately, polysilicon has fairly high resistance, better than silicon but much worse than metal. The problem is that the resistance of a long polysilicon line will slow down the system. That is, the capacitance of transistor gates in combination with high resistance causes an RC (resistive-capacitive) delay in the signal.

The solution is that the register select lines also run in the metal layer, a second set of lines immediately to the right of the register file. These lines branch off from the register file about 1/3 of the way down, run to the bottom, and then connect back to the polysilicon select lines at the bottom. This reduces the maximum resistance through a select line, increasing the speed.

A diagram showing how 8 metal lines run parallel to the main select lines. The register file is much taller than shown; the middle has been removed to make the diagram fit.

The stack control circuitry

A stack needs more control circuitry than a regular register file, since the circuitry must keep track of the position of the top of the stack.3 The control circuitry increments and decrements the top of stack (TOS) pointer as values are pushed or popped (purple).4 Moreover, an 8087 instruction can access a register based on its offset, for instance the third register from the top. To support this, the control circuitry can temporarily add an offset to the top of stack position (green). A multiplexer (red) selects either the top of stack or the adder output, and feeds it to the decoder (blue), which selects one of the eight stack registers in the register file (yellow), as described earlier.

The register stack in the 8087. Adapted from Patent USRE33629E. I don't know what the GRX field is. I also don't know why this shows a subtractor and not an adder.

The physical implementation of the stack circuitry is shown below. The logic at the top selects the stack operation based on the 16-bit micro-instruction.5 Below that are the three latches that hold the top of stack value. (The large white squares look important, but they are simply "jumpers" from the ground line to the circuitry, passing under metal wires.)

The stack control circuitry. The blue regions on the right are oxide residue that remained when I dissolved the metal rail for the 5V power.

The three-bit adder is at the bottom, along with the multiplexer. You might expect the adder to use a simple "full adder" circuit. Instead, it is a faster carry-lookahead adder. I won't go into details here, but the summary is that at each bit position, an AND gate produces a Carry Generate signal while an XOR gate produces a Carry Propagate signal. Logic gates combine these signals to produce the output bits in parallel, avoiding the slowdown of the carry rippling through the bits.

The incrementer/decrementer uses a completely different approach. Each of the three bits uses a toggle flip-flop. A few logic gates determine if each bit should be toggled or should keep its previous value. For instance, when incrementing, the top bit is toggled if the lower bits are 11 (e.g. incrementing from 011 to 100). For decrementing, the top bit is toggled if the lower bits are 00 (e.g. 100 to 011). Simpler logic determines if the middle bit should be toggled. The bottom bit is easier, toggling every time whether incrementing or decrementing.

The schematic below shows the circuitry for one bit of the stack. Each bit is implemented with a moderately complicated flip-flop that can be cleared, loaded with a value, or toggled, based on control signals from the microcode. The flip-flop is constructed from two set-reset (SR) latches. Note that the flip-flop outputs are crossed when fed back to the input, providing the inversion for the toggle action. At the right, the multiplexer selects either the register value or the sum from the adder (not shown), generating the signals to the decoder.

Schematic of one bit of the stack.

Drawbacks of the stack approach

According to the designers of the 8087,7 the main motivation for using a stack rather than a flat register set was that instructions didn't have enough bits to address multiple register operands. In addition, a stack has "advantages over general registers for expression parsing and nested function calls." That is, a stack works well for a mathematical expression since sub-expressions can be evaluated on the top of the stack. And for function calls, you avoid the cost of saving registers to memory, since the subroutine can use the stack without disturbing the values underneath. At least that was the idea.

The main problem is "stack overflow". The 8087's stack has eight entries, so if you push a ninth value onto the stack, the stack will overflow. Specifically, the top-of-stack pointer will wrap around, obliterating the bottom value on the stack. The 8087 is designed to detect a stack overflow using the register tags: pushing a value to a non-empty register triggers an invalid operation exception.6

The designers expected that stack overflow would be rare and could be handled by the operating system (or library code). After detecting a stack overflow, the software should dump the existing stack to memory to provide the illusion of an infinite stack. Unfortunately, bad design decisions made it difficult "both technically and commercially" to handle stack overflow.

One of the 8087's designers (Kahan) attributes the 8087's stack problems to the time difference between California, where the designers lived, and Israel, where the 8087 was implemented. Due to a lack of communication, each team thought the other was implementing the overflow software. It wasn't until the 8087 was in production that they realized that "it might not be possible to handle 8087 stack underflow/overflow in a reasonable way. It's not impossible, just impossible to do it in a reasonable way."

As a result, the stack was largely a problem rather than a solution. Most 8087 software saved the full stack to memory before performing a function call, creating more memory traffic. Moreover, compilers turned out to work better with regular registers than a stack, so compiler writers awkwardly used the stack to emulate regular registers. The GCC compiler reportedly needs 3000 lines of extra code to support the x87 stack.

In the 1990s, Intel introduced a new floating-point system called SSE, followed by AVX in 2011. These systems use regular (non-stack) registers and provide parallel operations for higher performance, making the 8087's stack instructions largely obsolete.

The success of the 8087

At the start, Intel was unenthusiastic about producing the 8087, viewing it as unlikely to be a success. John Palmar, a principal architect of the chip, had little success convincing skeptical Intel management that the market for the 8087 was enormous. Eventually, he said, "I'll tell you what. I'll relinquish my salary, provided you'll write down your number of how many you expect to sell, then give me a dollar for every one you sell beyond that."7 Intel didn't agree to the deal—which would have made a fortune for Palmer—but they reluctantly agreed to produce the chip.

Intel's Santa Clara engineers shunned the 8087, considering it unlikely to work: the 8087 would be two to three times more complex than the 8086, with a die so large that a wafer might not have a single working die. Instead, Rafi Nave, at Intel's Israel site, took on the risky project: “Listen, everybody knows it's not going to work, so if it won't work, I would just fulfill their expectations or their assessment. If, by chance, it works, okay, then we'll gain tremendous respect and tremendous breakthrough on our abilities.”

A small team of seven engineers developed the 8087 in Israel. They designed the chip on Mylar sheets: a millimeter on Mylar represented a micron on the physical chip. The drawings were then digitized on a Calma system by clicking on each polygon to create the layout. When the chip was moved into production, the yield was very low but better than feared: two working dies per four-inch wafer.

The 8087 ended up being a large success, said to have been Intel's most profitable product line at times. The success of the 8087 (along with the 8088) cemented the reputation of Intel Israel, which eventually became Israel's largest tech employer. The benefits of floating-point hardware proved to be so great that Intel integrated the floating-point unit into later processors starting with the 80486 (1989). Nowadays, most modern computers, from cellphones to mainframes, provide floating point based on the 8087, so I consider the 8087 one of the most influential chips ever created.

For more, follow me on Bluesky (@righto.com), Mastodon (@[email protected]), or RSS. I wrote some articles about the 8087 a few years ago, including the die, the ROM, the bit shifter, and the constants, so you may have seen some of this material before.

Notes and references

Most computers now use the IEEE 754 floating-point standard, which is based on the 8087. This standard has been awarded a milestone in computation. ↩
Curiously, reliable sources differ on the number of transistors in the 8087 by almost a factor of 2. Intel says 40,000, as does designer William Kahan (link). But in A Numeric Data Processor, designers Rafi Nave and John Palmer wrote that the chip contains "the equivalent of over 65,000 devices" (whatever "equivalent" means). This number is echoed by a contemporary article in Electronics (1980) that says "over 65,000 H-MOS transistors on a 78,000-mil² die." Many other sources, such as Upgrading & Repairing PCs, specify 45,000 transistors. Designer Rafi Nave stated that the 8087 has 63,000 or 64,000 transistors if you count the ROM transistors directly, but if you count ROM transistors as equivalent to two transistors, then you get about 75,000 transistors. ↩
The 8087 has a 16-bit Status Word that contains the stack top pointer, exception flags, the four-bit condition code, and other values. Although the Status Word appears to be a 16-bit register, it is not implemented as a register. Instead, parts of the Status Word are stored in various places around the chip: the stack top pointer is in the stack circuitry, the exception flags are part of the interrupt circuitry, the condition code bits are next to the datapath, and so on. When the Status Word is read or written, these various circuits are connected to the 8087's internal data bus, making the Status Word appear to be a monolithic entity. Thus, the stack circuitry includes support for reading and writing it. ↩
Intel filed several patents on the 8087, including Numeric data processor, another Numeric data processor, Programmable bidirectional shifter, Fraction bus for use in a numeric data processor, and System bus arbitration, circuitry and methodology. ↩
I started looking at the stack in detail to reverse engineer the micro-instruction format and determine how the 8087's microcode works. I'm working with the "Opcode Collective" on Discord on this project, but progress is slow due to the complexity of the micro-instructions. ↩
The 8087 detects stack underflow in a similar manner. If you pop more values from the stack than are present, the tag will indicate that the register is empty and shouldn't be accessed. This triggers an invalid operation exception. ↩
The 8087 is described in detail in The 8086 Family User's Manual, Numerics Supplement. An overview of the stack is on page 60 of The 8087 Primer by Palmer and Morse. More details are in Kahan's On the Advantages of the 8087's Stack, an unpublished course note (maybe for CS 279?) with a date of Nov 2, 1990 or perhaps August 23, 1994. Kahan discusses why the 8087's design makes it hard to handle stack overflow in How important is numerical accuracy, Dr. Dobbs, Nov. 1997. Another information source is the Oral History of Rafi Nave ↩↩

Unusual circuits in the Intel 386's standard cell logic

I've been studying the standard cell circuitry in the Intel 386 processor recently. The 386, introduced in 1985, was Intel's most complex processor at the time, containing 285,000 transistors. Intel's existing design techniques couldn't handle this complexity and the chip began to fall behind schedule. To meet the schedule, the 386 team started using a technique called standard cell logic. Instead of laying out each transistor manually, the layout process was performed by a computer.

The idea behind standard cell logic is to create standardized circuits (standard cells) for each type of logic element, such as an inverter, NAND gate, or latch. You feed your circuit description into software that selects the necessary cells, positions these cells into columns, and then routes the wiring between the cells. This "automatic place and route" process creates the chip layout much faster than manual layout. However, switching to standard cells was a risky decision since if the software couldn't create a dense enough layout, the chip couldn't be manufactured. But in the end, the 386 finished ahead of schedule, an almost unheard-of accomplishment.1

The 386's standard cell circuitry contains a few circuits that I didn't expect. In this blog post, I'll take a quick look at some of these circuits: surprisingly large multiplexers, a transistor that doesn't fit into the standard cell layout, and inverters that turned out not to be inverters. (If you want more background on standard cells in the 386, see my earlier post, "Reverse engineering standard cell logic in the Intel 386 processor".)

The photo below shows the 386 die with the automatic-place-and-route regions highlighted; I'm focusing on the red region in the lower right. These blocks of logic have cells arranged in rows, giving them a characteristic striped appearance. The dark stripes are the transistors that make up the logic gates, while the lighter regions between the stripes are the "routing channels" that hold the wiring that connects the cells. In comparison, functional blocks such as the datapath on the left and the microcode ROM in the lower right were designed manually to optimize density and performance, giving them a more solid appearance.

The 386 die with the standard-cell regions highlighted.

As for other features on the chip, the black circles around the border are bond wire connections that go to the chip's external pins. The chip has two metal layers, a small number by modern standards, but a jump from the single metal layer of earlier processors such as the 286. (Providing two layers of metal made automated routing practical: one layer can hold horizontal wires while the other layer can hold vertical wires.) The metal appears white in larger areas, but purplish where circuitry underneath roughens its surface. The underlying silicon and the polysilicon wiring are obscured by the metal layers.

The giant multiplexers

The standard cell circuitry that I'm examining (red box above) is part of the control logic that selects registers while executing an instruction. You might think that it is easy to select which registers take part in an instruction, but due to the complexity of the x86 architecture, it is more difficult. One problem is that a 32-bit register such as EAX can also be treated as the 16-bit register AX, or two 8-bit registers AH and AL. A second problem is that some instructions include a "direction" bit that switches the source and destination registers. Moreover, sometimes the register is specified by bits in the instruction, but in other cases, the register is specified by the microcode. Due to these factors, selecting the registers for an operation is a complicated process with many cases, using control bits from the instruction, from the microcode, and from other sources.

Three registers need to be selected for an operation—two source registers and a destination register—and there are about 17 cases that need to be handled. Registers are specified with 7-bit control signals that select one of the 30 registers and control which part of the register is accessed. With three control signals, each 7 bits wide, and about 17 cases for each, you can see that the register control logic is large and complicated. (I wrote more about the 386's registers here.)

I'm still reverse engineering the register control logic, so I won't go into details. Instead, I'll discuss how the register control circuit uses multiplexers, implemented with standard cells. A multiplexer is a circuit that combines multiple input signals into a single output by selecting one of the inputs.2 A multiplexer can be implemented with logic gates, for instance, by ANDing each input with the corresponding control line, and then ORing the results together. However, the 386 uses a different approach—CMOS switches—that avoids a large AND/OR gate.

Schematic of a CMOS switch.

The schematic above shows how a CMOS switch is constructed from two MOS transistors. When the two transistors are on, the output is connected to the input, but when the two transistors are off, the output is isolated. An NMOS transistor is turned on when its input is high, but a PMOS transistor is turned on when its input is low. Thus, the switch uses two control inputs, one inverted. The motivation for using two transistors is that an NMOS transistor is better at pulling the output low, while a PMOS transistor is better at pulling the output high, so combining them yields the best performance.3 Unlike a logic gate, the CMOS switch has no amplification, so a signal is weakened as it passes through the switch. As will be seen below, inverters can be used to amplify the signal.

The image below shows how CMOS switches appear under the microscope. This image is very hard to interpret because the two layers of metal on the 386 are packed together densely, but you can see that some wires run horizontally and others run vertically. The bottom layer of metal (called M1) runs vertically in the routing area, as well as providing internal wiring for a cell. The top layer of metal (M2) runs horizontally; unlike M1, the M2 wires can cross a cell. The large circles are vias that connect the M1 and M2 layers, while the small circles are connections between M1 and polysilicon or M1 and silicon. The central third of the image is a column of standard cells with two CMOS switches outlined in green. The cells are bordered by the vertical ground rail and +5V rail that power the cells. The routing areas are on either side of the cells, holding the wiring that connects the cells.

Two CMOS switches, highlighted in green. The lower switch is flipped vertically compared to the upper switch.

Removing the metal layers reveals the underlying silicon with a layer of polysilicon wiring on top. The doped silicon regions show up as dark outlines. I've drawn the polysilicon in green; it forms a transistor (brighter green) when it crosses doped silicon. The metal ground and power lines are shown in blue and red, respectively, with other metal wiring in purple. The black dots are vias between layers. Note how metal wiring (purple) and polysilicon wiring (green) are combined to route signals within the cell. Although this standard cell is complicated, the important thing is that it only needs to be designed once. The standard cells for different functions are all designed to have the same width, so the cells can be arranged in columns, snapped together like Lego bricks.

A diagram showing the silicon for a standard-cell switch. The polysilicon is shown in green. The bottom metal is shown in blue, red, and purple.

To summarize, this switch circuit allows the input to be connected to the output or disconnected, controlled by the select signal. This switch is more complicated than the earlier schematic because it includes two inverters to amplify the signal. The data input and the two select lines are connected to the polysilicon (green); the cell is designed so these connections can be made on either side. At the top, the input goes through a standard two-transistor inverter. The lower left has two transistors, combining the NMOS half of an inverter with the NMOS half of the switch. A similar circuit on the right combines the PMOS part of an inverter and switch. However, because PMOS transistors are weaker, this part of the circuit is duplicated.

A multiplexer is constructed by combining multiple switches, one for each input. Turning on one switch will select the corresponding input. For instance, a four-to-one multiplexer has four switches, so it can select one of the four inputs.

A four-way multiplexer constructed from CMOS switches and individual transistors.

The schematic above shows a hypothetical multiplexer with four inputs. One optimization is that if an input is always 0, the PMOS transistor can be omitted. Likewise, if an input is always 1, the NMOS transistor can be omitted. One set of select lines is activated at a time to select the corresponding input. The pink circuit selects 1, green selects input A, yellow selects input B, and blue selects 0. The multiplexers in the 386 are similar, but have more inputs.

The diagram below shows how much circuitry is devoted to multiplexers in this block of standard cells. The green, purple, and red cells correspond to the multiplexers driving the three register control outputs. The yellow cells are inverters that generate the inverted control signals for the CMOS switches. This diagram also shows how the automatic layout of cells results in a layout that appears random.

A block of standard-cell logic with multiplexers highlighted. The metal and polysilicon layers were removed for this photo, revealing the silicon transistors.

The misplaced transistor

The idea of standard-cell logic is that standardized cells are arranged in columns. The space between the cells is the "routing channel", holding the wiring that links the cells. The 386 circuitry follows this layout, except for one single transistor, sitting between two columns of cells.

The "misplaced" transistor, indicated by the arrow. The irregular green regions are oxide that was incompletely removed.

I wrote some software tools to help me analyze the standard cells. Unfortunately, my tools assumed that all the cells were in columns, so this one wayward transistor caused me considerable inconvenience.

The transistor turns out to be a PMOS transistor, pulling a signal high as part of a multiplexer. But why is this transistor out of place? My hypothesis is that the transistor is a bug fix. Regenerating the cell layout was very costly, taking many hours on an IBM mainframe computer. Presumably, someone found that they could just stick the necessary transistor into an unused spot in the routing channel, manually add the necessary wiring, and avoid the delay of regenerating all the cells.

The fake inverter

The simplest CMOS gate is the inverter, with an NMOS transistor to pull the output low and a PMOS transistor to pull the output high. The standard cell circuitry that I examined contains over a hundred inverters of various sizes. (Performance is improved by using inverters that aren't too small but also aren't larger than necessary for a particular circuit. Thus, the standard cell library includes inverters of multiple sizes.)

The image below shows a medium-sized standard-cell inverter under the microscope. For this image, I removed the two metal layers with acid to show the underlying polysilicon (bright green) and silicon (gray). The quality of this image is poor—it is difficult to remove the metal without destroying the polysilicon—but the diagram below should clarify the circuit. The inverter has two transistors: a PMOS transistor connected to +5 volts to pull the output high when the input is 0, and an NMOS transistor connected to ground to pull the output low when the input is 1. (The PMOS transistor needs to be larger because PMOS transistors don't function as well as NMOS transistors due to silicon physics.)

An inverter as seen on the die. The corresponding standard cell is shown below.

The polysilicon input line plays a key role: where it crosses the doped silicon, a transistor gate is formed. To make the standard cell more flexible, the input to the inverter can be connected on either the left or the right; in this case, the input is connected on the right and there is no connection on the left. The inverter's output can be taken from the polysilicon on the upper left or the right, but in this case, it is taken from the upper metal layer (not shown). The power, ground, and output lines are in the lower metal layer, which I have represented by the thin red, blue, and yellow lines. The black circles are connections between the metal layer and the underlying silicon.

This inverter appears dozens of times in the circuitry. However, I came across a few inverters that didn't make sense. The problem was that the inverter's output was connected to the output of a multiplexer. Since an inverter is either on or off, its value would clobber the output of the multiplexer.4 This didn't make any sense. I double- and triple-checked the wiring to make sure I hadn't messed up. After more investigation, I found another problem: the input to a "bad" inverter didn't make sense either. The input consisted of two signals shorted together, which doesn't work.

Finally, I realized what was going on. A "bad inverter" has the exact silicon layout of an inverter, but it wasn't an inverter: it was independent NMOS and PMOS transistors with separate inputs. Now it all made sense. With two inputs, the input signals were independent, not shorted together. And since the transistors were controlled separately, the NMOS transistor could pull the output low in some circumstances, the PMOS transistor could pull the output high in other circumstances, or both transistors could be off, allowing the multiplexer's output to be used undisturbed. In other words, the "inverter" was just two more cases for the multiplexer.

The "bad" inverter. (Image is flipped vertically for comparison with the previous inverter.)

If you compare the "bad inverter" cell below with the previous cell, they look almost the same, but there are subtle differences. First, the gates of the two transistors are connected in the real inverter, but disconnected by a small gap in the transistor pair. I've indicated this gap in the photo above; it is hard to tell if the gap is real or just an imaging artifact, so I didn't spot it. The second difference is that the "fake" inverter has two input connections, one to each transistor, while the inverter has a single input connection. Unfortunately, I assumed that the two connections were just a trick to route the signal across the inverter without requiring an extra wire. In total, this cell was used 32 times as a real inverter and 9 times as independent transistors.

Conclusions

Standard cell logic and automatic place and route have a long history before the 386, back to the early 1970s, so this isn't an Intel invention.5 Nonetheless, the 386 team deserves the credit for deciding to use this technology at a time when it was a risky decision. They needed to develop custom software for their placing and routing needs, so this wasn't a trivial undertaking. This choice paid off and they completed the 386 ahead of schedule. The 386 ended up being a huge success for Intel, moving the x86 architecture to 32 bits and defining the dominant computer architecture for the rest of the 20th century.

If you're interested in standard cell logic, I also wrote about standard cell logic in an IBM chip. I plan to write more about the 386, so follow me on Mastodon, Bluesky, or RSS for updates. Thanks to Pat Gelsinger and Roxanne Koester for providing helpful papers.

For more on the 386 and other chips, follow me on Mastodon (@[email protected]), Bluesky (@righto.com), or RSS. (I've given up on Twitter.) If you want to read more about the 386, I've written about the clock pin, prefetch queue, die versions, packaging, and I/O circuits.

Notes and references

The decision to use automatic place and route is described on page 13 of the Intel 386 Microprocessor Design and Development Oral History Panel, a very interesting document on the 386 with discussion from some of the people involved in its development. ↩
Multiplexers often take a binary control signal to select the desired input. For instance, an 8-to-1 multiplexer selects one of 8 inputs, so a 3-bit control signal can specify the desired input. The 386's multiplexers use a different approach with one control signal per input. One of the 8 control signals is activated to select the desired input. This approach is called a "one-hot encoding" since one control line is activated (hot) at a time. ↩
Some chips, such as the MOS Technology 6502 processor, are built with NMOS technology, without PMOS transistors. Multiplexers in the 6502 use a single NMOS transistor, rather than the two transistors in the CMOS switch. However, the performance of the switch is worse. ↩
One very common circuit in the 386 is a latch constructed from an inverter loop and a switch/multiplexer. The inverter's output and the switch's output are connected together. The trick, however, is that the inverter is constructed from special weak transistors. When the switch is disabled, the inverter's weak output is sufficient to drive the loop. But to write a value into the latch, the switch is enabled and its output overpowers the weak inverter.

The point of this is that there are circuits where an inverter and a multiplexer have their outputs connected. However, the inverter must be constructed with special weak transistors, which is not the situation that I'm discussing. ↩
I'll provide more history on standard cells in this footnote. RCA patented a bipolar standard cell in 1971, but this was a fixed arrangement of transistors and resistors, more of a gate array than a modern standard cell. Bell Labs researched standard cell layout techniques in the early 1970s, calling them Polycells, including a 1973 paper by Brian Kernighan. By 1979, A Guide to LSI Implementation discussed the standard cell approach and it was described as well-known in this patent application. Even so, Electronics called these design methods "futuristic" in 1980.

Standard cells became popular in the mid-1980s as faster computers and improved design software made it practical to produce semi-custom designs that used standard cells. Standard cells made it to the cover of Digital Design in August 1985, and the article inside described numerous vendors and products. Companies like Zymos and VLSI Technology (VTI) focused on standard cells. Traditional companies such as Texas Instruments, NCR, GE/RCA, Fairchild, Harris, ITT, and Thomson introduced lines of standard cell products in the mid-1980s. ↩

Notes on the Intel 8086 processor's arithmetic-logic unit

Microcode

The ALU circuit

ALU control circuitry on the die

Conclusions

Notes and references

Conditions in the Intel 8087 floating-point chip's microcode

The 8087's microcode

Selecting a condition

The conditions

Conclusions

Notes and references

The stack circuitry of the Intel 8087 floating point chip, reverse-engineered

The registers

The decoder

The stack control circuitry

Drawbacks of the stack approach

The success of the 8087

Notes and references

Unusual circuits in the Intel 386's standard cell logic

The giant multiplexers

The misplaced transistor

The fake inverter

Conclusions

Notes and references

Don't miss a post!