|Zotac GeForce GTX-480 Fermi Video Card|
|Reviews - Featured Reviews: Video Cards|
|Written by Olin Coles|
|Tuesday, 11 May 2010|
Page 3 of 19
NVIDIA GF100 GPU Fermi Architecture
NVIDIA's latest GPU is codenamed GF100, and is the first graphics processor based on the Fermi architecture. In this article, Benchmark Reviews explains the technical architecture behind NVIDIA's GF100 graphics processor and offers an insight into upcoming Fermi-based GeForce video cards. For those who are not familiar, NVIDIA's GF100 GPU is their first graphics processor to support DirectX-11 hardware features such as tessellation and DirectCompute, while also adding heavy particle and turbulence effects. The GF100 GPU is also the successor to the GT200 graphics processor, which launched in the GeForce GTX 280 video card back in June 2008. NVIDIA has since redefined their focus, and GF100 proves a dedication towards next generation gaming effects such as raytracing, order-independent transparency, and fluid simulations. Rest assured, the new GF100 GPU is more powerful than the GT200 could ever be, and early results indicate a Fermi-based video card delivers far more than twice the gaming performance over a GeForce GTX-280.
GF100 is not another incremental GPU step-up like we had going from G80 to GT200. While processor cores have grown from 128 (G80) and 240 (GT200), they now reach 512 and earn the title of NVIDIA CUDA (Compute Unified Device Architecture) cores. The key here is not only the name, but that the name now implies an emphasis on something more than just graphics. Each Fermi CUDA processor core has a fully pipelined integer arithmetic logic unit (ALU) and floating point unit (FPU). GF100 implements the new IEEE 754-2008 floating-point standard, providing the fused multiply-add (FMA) instruction for both single and double precision arithmetic. FMA improves over a multiply-add (MAD) instruction by doing the multiplication and addition with a single final rounding step, with no loss of precision in the addition. FMA minimizes rendering errors in closely overlapping triangles.
Based on Fermi's third-generation Streaming Multiprocessor (SM) architecture, GF100 doubles the number of CUDA cores over the previous architecture. NVIDIA GeForce GF100 Fermi GPUs are based on a scalable array of Graphics Processing Clusters (GPCs), Streaming Multiprocessors (SMs), and memory controllers. The NVIDIA GF100 implements four GPCs, sixteen SMs, and six memory controllers. Expect NVIDIA to launch GF100 products with different configurations of GPCs, SMs, and memory controllers to address different price points.
CPU commands are read by the GPU via the Host Interface. The GigaThread Engine fetches the specified data from system memory and copies them to the frame buffer. GF100 implements six 64-bit GDDR5 memory controllers (384-bit total) to facilitate high bandwidth access to the frame buffer. The GigaThread Engine then creates and dispatches thread blocks to various SMs. Individual SMs in turn schedules warps (groups of 32 threads) to CUDA cores and other execution units. The GigaThread Engine also redistributes work to the SMs when work expansion occurs in the graphics pipeline, such as after the tessellation and rasterization stages.
GF100 implements 512 CUDA cores, organized as 16 SMs of 32 cores each. Each SM is a highly parallel multiprocessor supporting up to 48 warps at any given time. Each CUDA core is a unified processor core that executes vertex, pixel, geometry, and compute kernels. A unified L2 cache architecture services load, store, and texture operations. GF100 has 48 ROP units for pixel blending, antialiasing, and atomic memory operations. The ROP units are organized in six groups of eight. Each group is serviced by a 64-bit memory controller. The memory controller, L2 cache, and ROP group are closely coupled-scaling one unit automatically scales the others.
NVIDIA GigaThread Thread Scheduler
One of the most important technologies of the Fermi architecture is its two-level, distributed thread scheduler. At the chip level, a global work distribution engine schedules thread blocks to various SMs, while at the SM level, each warp scheduler distributes warps of 32 threads to its execution units. The first generation GigaThread engine introduced in G80 managed up to 12,288 threads in real-time. The Fermi architecture improves on this foundation by providing not only greater thread throughput, but dramatically faster context switching, concurrent kernel execution, and improved thread block scheduling.
What's new in Fermi?
With any new technology, consumers want to know what's new in the product. The goal of this article is to share in-depth information surrounding the Fermi architecture, as well as the new functionality unlocked in GF100. For clarity, the 'GF' letters used in the GF100 GPU name are not an abbreviation for 'GeForce'; they actually denote that this GPU is a Graphics solution based on the Fermi architecture. The next generation of NVIDIA GeForce-series desktop video cards will use the GF100 to promote the following new features:
Benchmark Reviews also more detail in our full-length NVIDIA GF100 GPU Fermi Graphics Architecture guide.