• A Staff Canada Group Company

A brief overview of IBM’s new 7 nm Telum mainframe CPU

A brief overview of IBM’s new 7 nm Telum mainframe CPU

 September 3, 2021 at 3:01 am   |     Author:   |     Technology  

Each Telum package consists of two 7nm, eight-core / sixteen-thread processors running at a <em>base</em> clock speed above 5GHz. A typical system will have sixteen of these chips in total, arranged in four-socket
Enlarge / Each Telum package consists of two 7nm, eight-core / sixteen-thread processors running at a base clock speed above 5GHz. A typical system will have sixteen of these chips in total, arranged in four-socket “drawers.”

From the perspective of a traditional x86 computing enthusiast—or professional—mainframes are strange, archaic beasts. They’re physically enormous, power-hungry, and expensive by comparison to more traditional data-center gear, generally offering less compute per rack at a higher cost.

This raises the question, “Why keep using mainframes, then?” Once you hand-wave the cynical answers that boil down to “because that’s how we’ve always done it,” the practical answers largely come down to reliability and consistency. As AnandTech’s Ian Cutress points out in a speculative piece focused on the Telum’s redesigned cache, “downtime of these [IBM Z] systems is measured in milliseconds per year.” (If true, that’s at least seven nines.)

IBM’s own announcement of the Telum hints at just how different mainframe and commodity computing’s priorities are. It casually describes Telum’s memory interface as “capable of tolerating complete channel or DIMM failures, and designed to transparently recover data without impact to response time.”

When you pull a DIMM from a live, running x86 server, that server does not “transparently recover data”—it simply crashes.

IBM Z-series architecture

Telum is designed to be something of a one-chip-to-rule-them-all for mainframes, replacing a much more heterogeneous setup in earlier IBM mainframes.

The 14 nm IBM z15 CPU which Telum is replacing features five total processors—two pairs of 12-core Compute Processors and one System Controller. Each Compute Processor hosts 256MiB of L3 cache shared between its 12 cores, while the System Controller hosts a whopping 960MiB of L4 cache shared between the four Compute Processors.

Five of these z15 processors—each consisting of four Compute Processors and one System Controller—constitutes a “drawer.” Four drawers come together in a single z15-powered mainframe.

Although the concept of multiple processors to a drawer and multiple drawers to a system remains, the architecture inside Telum itself is radically different—and considerably simplified.

Telum architecture

Telum is somewhat simpler at first glance than z15 was—it’s an eight-core processor built on Samsung’s 7nm process, with two processors combined on each package (similar to AMD’s chiplet approach for Ryzen). There is no separate System Controller processor—all of Telum’s processors are identical.

From here, four Telum CPU packages combine to make one four-socket “drawer,” and four of those drawers go into a single mainframe system. This provides 256 total cores on 32 CPUs. Each core runs at a base clockrate over 5 GHz—providing more predictable and consistent latency for real-time transactions than a lower base with higher turbo rate would.

Pockets full of cache

Doing away with the central System Processor on each package meant redesigning Telum’s cache, as well—the enormous 960MiB L4 cache is gone, as well as the per-die shared L3 cache. In Telum, each individual core has a private 32MiB L2 cache—and that’s it. There is no hardware L3 or L4 cache at all.

This is where things get deeply weird—while each Telum core’s 32MiB L2 cache is technically private, it’s really only virtually private. When a line from one core’s L2 cache is evicted, the processor looks for empty space in the other cores’ L2. If it finds some, the evicted L2 cache line from core x is tagged as an L3 cache line and stored in core y‘s L2.

OK, so we have a virtual, shared up-to-256MiB L3 cache on each Telum processor, composed of the 32MiB “private” L2 cache on each of its eight cores. From here, things go one step further—that 256MiB of shared “virtual L3” on each processor can, in turn, be used as shared “virtual L4” among all processors in a system.

Telum’s “virtual L4” works largely the same way its “virtual L3” did in the first place—evicted L3 cache lines from one processor look for a home on a different processor. If another processor in the same Telum system has spare room, the evicted L3 cache line gets retagged as L4 and lives in the virtual L3 on the other processor (which is made up of the “private” L2s of its eight cores) instead.

AnandTech’s Ian Cutress goes into more detail on Telum’s cache mechanisms. He eventually sums them up by answering “How is this possible?” with a simple “magic.”

AI inference acceleration

https://www.youtube.com/watch?v=TRBgBbvYiQw

IBM’s Christian Jacobi briefly outlines Telum’s AI acceleration in this two-minute clip.

Telum also introduces a 6TFLOPS on-die inference accelerator. It’s intended to be used for—among other things—real-time fraud detection during financial transactions (as opposed to shortly after the transaction).

In the quest for maximum performance and minimal latency, IBM threads several needles. The new inference accelerator is placed on-die, which allows for lower latency interconnects between the accelerator and CPU cores—but it’s not built into the cores themselves, a la Intel’s AVX-512 instruction set.

The problem with in-core inference acceleration like Intel’s is that it typically limits the AI processing power available to any single core. A Xeon core running an AVX-512 instruction only has the hardware inside its own core available to it, meaning larger inference jobs must be split among multiple Xeon cores to extract the full performance available.

Telum’s accelerator is on-die but off-core. This allows a single core to run inference workloads with the might of the entire on-die accelerator, not just the portion built into itself.

Listing image by IBM

Source link