|AMD FX-8150 Bulldozer Processor|
|Reviews - Featured Reviews: Processors|
|Written by David Ramsey|
|Wednesday, 12 October 2011|
Page 3 of 18
Bulldozer CPU Architecture
Technically, the term "Bulldozer" refers to the two-core module that AMD uses as a building block for its new processors. Desktop CPUs built with Bulldozer modules are code-named "Zambezi", while server processors built with Bulldozer modules are "Interlagos" for CPUs designed for single and dual processor systems, and "Valencia" for processors designed for four or more CPU systems. Here's a block diagram of a Bulldozer module:
Each module comprises two integer execution units and one floating point execution unit. All three units share the instruction fetch and decoders, as well as the L2 cache, but each core has its own instruction scheduler and L1 cache. Each module can handle two concurrent threads (one on each integer unit), so AMD considers a single Bulldozer module to be "dual core." But is this an accurate description? Intel supplies this labeled die image for Sandy Bridge processors:
Note that there are four areas labeled "Core". For comparison, here's a labeled die image for a quad-module (8 cores) AMD Bulldozer CPU:
Here, there are four areas labeled "Bulldozer module", and as we saw above, each module has two cores.
But what does a "core" consist of? In Sandy Bridge, each core contains three Arithmetic Logic Units (ALUs) and two Address Generation Units (AGUs). AMD's Phenom II CPU has three ALUs and three AGUs per core, while a Bulldozer "integer core" has two ALUs and two AGUs, for a total of four ALUs and four AGUs per module. (The floating point execution unit is a separate entity. Since floating point instructions comprise only a small percentage of most code, the single FPU in a Bulldozer module is shared.)
An Arithmetic Logic Unit does the actual work of handling the instruction, be it a conditional, bit rotate, add, or other integer operation. The Address Generator Unit handles address generation, and to explain that would involve getting deep into Intel address architecture, which is beyond the scope of this article. Suffice it to say that the AGU is needed for everything from figuring out the real address of that branch destination to where to put the results of a given calculation to translating between virtual memory addresses and physical addresses.
Keeping these units "fed" are the instruction schedulers, which decide how to dispatch instructions that have been fetched from memory and decoded. A Phenom II core can issue 3 ALU or AGU instructions per clock; Sandy Bridge can issue 3 ALU/1 AGU or 2 ALU/2 AGU instructions per clock (four total), while Bulldozer can issue 2 ALU/2 AGU per clock (also for a total of four). Note that while the Phenom II core has the most ALUs/AGUs (6 total), it can issue the fewest instructions per clock (three, as opposed to four for Sandy Bridge and Bulldozer). There are other differences: each Sandy Bridge core has its own floating point unit, while a Bulldozer module has two integer units (the ALUs) and only a single floating point unit. So the eight core FX-8150 has four floating point units, just like the four-core Sandy Bridge. Since the vast majority of program instructions will be handled by the integer ALUs, this shouldn't affect real-world performance...but our benchmarks will show if this is the case.
Complicating comparisons further are things like decode queues, instruction pipelines, branch prediction, thread retirement, cache management, and more. Modern processors are hideously complex; long gone are the days when a programmer could figure out exactly how long a segment of code would take to execute by simply adding up the number of clock cycles required for each instruction and figuring the time based on the CPU clock frequency! (Yes, we really used to do that.) Bulldozer adds further optimizations: for example, if only one integer core is being used, it has access to all the resources on the module, such as the cache: there's nothing "reserved" for the idle core.
AMD has also finally caught up with some of Intel's new features, like AES-NI for ultra-fast encryption and decryption, Advanced Vector Instructions (AVX), and a 32nm fabrication process (which will hopefully help with overclocking). AMD also has some new instructions of their own: the FMA (floating point multiply-accumulate) and XOP instructions. The former allows fast multiply-and-add sequences, which turn out to be useful in video transcoding, among other things. The XOP instructions are an extension to SSE5 that AMD announced back in 2009 and consist of new integer vector instructions, such as integer vector multiply-accumulate and integer vector compare; the Bulldozer CPUs are the first implementation of XOP.
All of these features sound really cool when AMD's technical and marketing folks are giving the presentations. But unless you're a CPU architect (and I'm not), it's impossible to judge their real-world impact on paper. And while I'm convinced some of AMD's new internals are very advanced, Intel's made some enormous strides with Sandy Bridge, especially in the crucial "instructions per clock" metric. Simply put, this means that at the same frequency as previous generation Intel CPUs, Sandy Bridge is significantly faster. And of course many Intel processors support Hyper-Threading, which doubles the number of available cores (as far as software is concerned), so that a four-core processor appears as an eight-core processor. AMD claims that eight real cores should provide better performance than four real and four virtual cores; while this makes sense, AMD is positioning the FX-8150, the top of the Zambezi desktop processor lineup, against the non-HyperThreading 2500K rather than the Hyper-Threading capable Core i7-2600K.
Well, all this makes for interesting arguments in bars, but what really matters is the performance. So let's get on with the testing.