News and Analysis

The x86 Empire Strikes Back in the Datacenter Market

François Cattelain — Tue, 06 Aug 2024 13:40:55 GMT

Note: To simplify this overview and following the Great Decoupling, we will limit ourselves to CPUs available in Western countries, and not cover hardware options from China.

Executive Summary:

·       It will be very hard to convince the three biggest CSPs to come back to commercial offerings, as in-house designs offer them lower TCO and greater control over their supply chain. AWS is far ahead in this regard.
·       Arm’s CSS IP is the perfect sweet spot between time to market and customizability, and will continue to be successful.
·       Ampere Computing is the last commercial Arm datacenter CPU vendor left, but seems to have designed itself into a corner with its hubristic choice of custom Arm cores. Time to market is everything, and execution problems can be very unforgiving for such a small company. However, for now, the company has this market all for itself, and may yet be successful in the coming years.
·       The two x86 incumbents have identified the Arm threat in the datacenter long ago, and now both have competitive offerings to counter it.
·       Intel is back. It has a competitive cloud native datacenter CPU implemented on a competitive in-house process node. This is an excellent first step in the long road ahead for the turnaround engineered by CEO Pat Gelsinger.

After our article four years ago about the opportunity for Arm to greatly increase its market share among the Cloud Service Providers (aka the CSPs), it is now time to have another look, this time at the broader situation in the datacenter. To do so, we will distinguish between the three biggest CSPs and the rest of the datacenter market, which outside of the Cloud Service Providers is often called the enterprise market.

But first, we will have to quickly recap what has happened these last four years. In this period, the market has witnessed a true Cambrian explosion of successful Arm designs for servers. And before we examine the state of affairs in the enterprise market, we will first look at the three biggest CSPs: AWS (Amazon Web Services), Microsoft and Google.

The AWS Graviton Family: Slowly but surely encompassing all use-cases

Among the Arm success stories in the datacenter, AWS is arguably the poster child, as it was simply the first to market. The following table recapitulates the entire Graviton family:

·       Preview date is the date when low volume deployment starts; higher volume comes approximatively 6 months later.
·       No SMT across the board, due to Arm’s design choices. More on that below.
·       TDPs are unknown but are presumably pretty low compared to equivalent offerings from Intel and AMD, as core clocks are relatively modest (with no turbo) and generally speaking CSPs are all about TCO, hence a lower TDP to lower electricity cost (see here the part about TCO for more information).
·       The Arm N1 supports 4x 128-bit SIMD engines when operating in NEON mode.
·       The Arm N1 supports all Armv8.2 instructions plus some 8.3, 8.4 and 8.5 instructions. See here for more details.
·       The Arm V1 supports all Armv8.4 instructions except for one, plus most 8.5 instructions and some 8.6. See here.
·       A DDR4 ECC RDIMM has a single 64-bit channel plus 8-bit parity; A DDR5 ECC RDIMM has two 32-bit channels, each with 8-bit parity; hence the discrepancy between DDR4 and DDR5 bus width. With DDR5, “channel” has become a misleading term.
·       The Graviton 3E, announced in 2022 and absent in this table, is – as far as is publicly known – simply a Graviton 3 with a much higher clock and a higher TDP.
·       All numbers are per socket for the Graviton4.                     Sources: 1, 2, 3, 4, 5.

This table gives a pretty good overview of the industry as a whole these past years. TSMC has been the go-to foundry for most of the fabless players, and 16nm to 7nm to 5nm is a pretty traditional journey. Indeed, TSMC’s 10nm process node was barely used.

Even though AWS has certainly prioritized low TDP – and thus relatively low clocks – for each design (see notes of the table about TDP), core clocks still manage to slightly increase from one generation to the next. So, it seems that even if you prioritize low TDP, each new generation of process nodes still brings performance benefits (that is, higher frequency), along of course with the obligatory density improvements.

Starting with Graviton3, the chips aren’t monolithic anymore, which is quite an achievement in terms of ASIC design capabilities. Moreover, even though the V2 CPU cores in Graviton4 supports the more advanced SVE2 specification, both the V1 in Graviton3 and the V2 in Graviton4 support dual 256-bit SVE/SVE2 engines per core, which is a pretty common occurrence in the industry. For example, AMD’s Zen4 AVX512 support is implemented via dual 256-bit engines. And finally, Graviton4 is a 2P capable platform with 96 cores, 228 MB of L2+L3 cache in total and 12 DDR5 memory “channels” per socket, which positions it more on less on par with the latest and greatest from Intel and AMD, but more on that below.

The Graviton Family (source)

Starting with Graviton3, we notice the absence of an IHS, surely to increase the thermal efficiency of the cooling solution.

Also, Graviton4’s PCIe chiplets are positioned on each side of the compute die, to better route Gen5 signals on the motherboard’s PCB (with some at the front of the chassis and some on the back); meanwhile, memory controller chiplets have to stay very close to the compute die to help with signal integrity, latency and power consumption.

At AWS re:Invent 2023, Ali Saidi, senior principal engineer at the Annapurna Labs (the ASIC design company Amazon brought in 2015 to realize its custom silicon ambitions), gave a few more enlightening explanations about the Graviton family’s evolution. Basically, it’s all about acompassing ever more use cases with each now generation of products.

Graviton1 was a pretty modest ASIC designed first and foremost to test the acceptability of a non x86 architecture in the AWS’s cloud. And even though it came out before Arm itself released its Neoverse line of CPU IP specifically tailored for the datacenter, it was a resounding success.

Graviton2 was meant to be a general-purpose CPU greatly expanding the number of applicable workloads compared to the previous generation, with 4 times more CPU cores, and much beefier ones too. However, it still maintained its focus on integer performance, as the majority of workloads in the cloud are considered industry-wide to be integer based.

Graviton3, for its part, has the same number of cores and the same memory bus width as its predecessor, but it brought in much better floating point and SIMD accelerated performance, thanks to its V1 cores. This allowed it to address yet another part of the market, like HPC and other FP heavy workloads. Also, going from DDR4 to DDR5 allowed AWS to increase memory bandwidth by 50% all the while maintaining the same memory data bus width.

Finally, Graviton4 greatly expands workload applicability again. It is a 2P capable platform, with 50% more CPU cores and 75% more memory bandwidth per socket compared to the previous generation. All in all, compared to its predecessor, a full 2P Graviton4 configuration allows for 3 times more CPU cores, 3.5 times more memory bandwidth, and 3 times more memory capacity (assuming Graviton3 and Graviton4 systems are deployed with RDIMMs of the same capacity, which seems like a pretty reasonable assumption given the time frame of their respective launch and the state of the DDR5 market at that time). So Graviton4 expands the possibilities even further for scale-up applications, for example databases requiring *a lot* of memory.

Source AWS re:Invent 2023 pdf

Not only does Annapurna Labs seems to have executed pretty well on its roadmap, but it also has clearly paced itself by not trying do to everything at once in a single generation. In hindsight, this seems like a pretty good strategy for a Cloud Service Provider. Semiconductor design and manufacturing is hard, and it’s probably wise to progress slowly in this endeavor.

Microsoft’s Cobalt 100 and the very successful Arm CSS initiative

In November 2023, Microsoft announced its first Arm server CPU designed in-house, the Cobalt 100. Preview became available 6 months later. Not much is known about it, apart from the fact that it is a 128 cores design based on Arm off-the-shelf N2 core IP and implemented on TSMC’s 5nm process node. It was first announced by Arm in April 2021, is Armv9 compliant and thus supports SVE2, but only has with two 128-bit SVE2 engines. So contrary to the V family of cores found in the Graviton3 and Graviton4, the main focus of the N family remains integer performance. And by all accounts, integer workloads seem to be more prevalent among CSPs’ customers’ workloads, so Microsoft hasn’t done anything strange here.

According to persistent industry chatter, the Cobalt 100 may be based on Arm’s very successful Neoverse Compute Subsystem (CSS) initiative, which was examined in detail here. Instead of simply offering the CPU core IP (and eventually its verified implementation in silicon for a popular process node), CSS allows Arm to offer its licensees almost the entire CPU design already validated in silicon and implemented as a RTL (to simplify, that is all what is necessary for a foundry to build the finished product).

In more detail, along with CPU core IP, the CSS provides almost all the uncore components. Namely:

·       First and foremost, the CSS includes the Arm CMN 700, which is the Mesh Based Coherent Interconnect. It is the interconnect that binds together all the sub-blocks of the processor together, and it allows for memory coherency. It is extremely important in a big CPU, and can make or break a design.
·       The System Control Processor (SCP) and the Manageability Control Processor (MCP). The SCP provides internal management of the entire processor while the MCP allows for communication with the external platform-level management controller (the BMC).
·       The Memory Management Unit (MMU) and the Generic Interrupt Controller (GIC). The MMU is required to handle memory address translation and is basically required in all modern processors. The GIC handles interrupts and an equivalent controller is also found in every single processor.
·       Various optional interconnect, like “Accelerator Attach” (to directly attach accelerators with custom interconnects, for performance purposes), “Multichip Interfaces” (to cobble together two dies in one package), and “CMN Gateway” for multi-socket interconnects.

Not included in the CSS are the PCIe/CXL IP blocks and the memory controller IP blocks. It just so happens that the entire industry (except maybe for the likes of Intel and AMD, and possibly Apple and Qualcomm, that probably design their own PCIe and memory IP blocks) has settled a few years ago on the IP blocks offered by renowned IP giants Synopsys and Cadence for these sub-blocks, and Arm has stopped offering new versions of its memory controller IP blocks since the launch of the CMN 700 mesh interconnect in 2021.

Crucially all of these IP blocks are validated in silicon for a specific process node (in this case TSMC’s 5nm), which makes life incredibly easier for the Arm licensee.

Why bother with all these details, one may ask. Well, this shows that a modern server CPU is much more than just the implementation of the CPU core IP, and there are many other critical components that need to be developed, tested, and validated in silicon. With the CSS, Arm allows its licensees to experience a huge reduction in time to market, which is utterly important in this industry.

By implementing the 200 mm sq (a relatively modest die size for such a powerful chip) 64 cores variant of the N2 CSS, Microsoft may have been able to come up with a dual die, single package, 128 cores monster in a record 13 months, which is simply astonishing. The Arm CSS pushes the boundaries of the “off the shelf” concept even further, allowing a paradigm-shifting reduction in time-to-market. Please note that Arm’s CSS are only available for the N2, N3 and V3 cores, so none of the Graviton CPUs could have benefited from it.

A quick look at the Google’s Axion CPU and a table to recap it all

After AWS that started its trailblazing journey back in November 2018, and Microsoft that made its first announcement in November 2023, Google is the last of the big three Cloud Service Providers to announce its homegrown Arm datacenter CPU: the Google Axion. Almost nothing is publicly known about it, except that it is based on the Arm V2 CPU IP. One notable fact is that it incorporates the in-house Titanium controllers designed to off-load network and storage I/O processing and security operations.

This highlights one of the many advantages for the biggest CSPs to adopt self-designed hardware: they can tailor their hardware to their specific needs. This mostly relates to specific network and storage optimizations and hardware acceleration, just like AWS has been doing for many years with its homegrown Nitro DPU since at least 2021.

This way, CSPs can co-optimize their in-house software stack with their indigenous hardware, thus creating a virtuous circle of lower TCO and greater control over their critical supply chains. In other words, once they get a taste of it, and unless some of them hit a brick wall of repeated execution failures, it will be very hard to convince the biggest CSPs and hyperscalers to come back to commercial offerings. That would mean higher acquisition costs, higher operating costs (with no optimizations for the in-house software stack), and less control on the timing of hardware refresh cycles. Higher TCO and less overall control over the infrastructure buildout aren’t exactly great conversation starters in corporate boardrooms.

The following table recapitulates the specifications of the homegrown Arm datacenter CPUs of the three biggest Western Cloud Service Providers. Arm’s IP reigns supreme here, and that’s not by accident. Time to market, anyone? Please note that the greyed-out specifications for the Cobalt 100 assume that it is indeed based on a dual-die Arm CSS N2 implementation (see CSS slide above for more details). Also, no Graviton1 in this table as it is simply not relevant anymore.

Ampere Computing, the only Western commercial Arm Datacenter CPU player left

Beyond the CSPs indigenous efforts, and notwithstanding Nvidia’s special case which we will get to further below, Ampere Computing is indeed the last commercial Arm datacenter CPU maker left standing, after Marvell quit the market by cancelling its ThunderX3 line in August 2020. It is of course very important to note that, Marvell being Marvell, they didn’t exactly abandon this market, but rather redirected their ambitions towards building ASICs (including CPUs) for third parties. Together with Broadcom and lesser-known players like GUC, Marvell is indeed one of the few companies with the requisite know-how necessary to capitalize on the CSPs and hyperscalers thirst for in-house silicon by helping them design their homegrown CPUs and accelerators.

In other words, Marvell decided that it had more chance of making money by building custom CPUs for third parties rather than launching an entire line of commercially available SKUs. Which brings us back to Ampere Computing. The company was founded in 2018 by former Intel President Renée James with funding from the Carlyle Group, Arm and Oracle. Oracle, of course, is often considered to be in the top five biggest Western CSPs, and has been one of the first big Ampere customers.

Ampere was the first to market with a commercially available 64+ cores Arm datacenter CPU in December 2020, choosing the off-the-shelf Arm N1 CPU IP for its first generation of products, the 80 cores Ampere Altra. This was quickly followed in September 2021 by a 128 cores variant, the Ampere Altra Max, which was basically the same CPU with more cores and slightly less L3 (probably the keep the monolithic die size in check). Even though the Altra made for a good “cloud workloads all-rounder” CPU, the Altra Max was badly starved of L3 and memory bandwidth, and thus could only shine in a specific subset of CSP workloads. See here this excellent review for more details on this matter. In any case, at that time, Ampere had come up with a solid, attractive and timely offering that boded well for the company’s future. But then came…

The fateful choice of custom Arm cores and the bane of execution problems

In May 2022, the company announced its next generation product, the 5nm AmpereOne, sporting up to 192 cores. The bombshell, of course, was that these CPUs would sport custom Arm cores instead of using Arm’s off-the-shelf IP. We have examined the debate around “off-the-shelf vs custom cores” before, and four years later, not much has changed. The “off-the shelf strategy” is less risky (especially for a company with limited resources), and allows for shorter time-to-market (TTM). As for the “custom cores strategy”, let’s quote the piece published by STH four years ago:

“The custom cores strategy can only pay off if one is able to execute well enough and fast enough, all the while offering an obvious price or performance advantage. It necessitates a lot more resources, but if the differentiation is a win with customers, the payoff can potentially be big.”

The problem, of course, is that for all intents and purposes, AmpereOne is pretty late, and has only been shipping this month (August 2024). This may be viewed as a controversial statement, as AmpereOne’s availability in the cloud has been announced many times these past 18 months. However, trustworthy sources like Michel Larabel from Phoronix and Patrick Kennedy from Servethehome pretty much confirm that AmpereOne’s availability has been a pretty big problem. [August 2024 is probably the month when AmpereOne really becomes available. Expect third party benchmarks of AmpereOne *very* soon]

It is of course impossible to know if the delay incurred by AmpereOne is due to the choice of custom cores or to some other factors. It could be due to a separate design problem. But one thing is certain: opting for Arm’s off-the-shelf CPU IP allows for lower risk and faster time to market, all things that AmpereOne has sorely missed. From September 2021 to August 2024, this three years delay is the sign of a serious execution problem.

Yet another table, and Ampere’s future outlook

This table recapitulates all Western Arm datacenter CPUs available by 24H2. More about Nvidia’s Grace special case in the next chapter below.

One important note:

The “shipping” line indicates start of preview or beginning of significant volume deployment; yes, we are comparing apples and oranges here, and CSPs’ in-house CPUs and commercially available ones are different beasts altogether.

On the face of it, AmpereOne doesn’t look out of place, with a 192 cores 2P platform shipping *now*. But a closer look reveals a damaging three years delay between Altra Max and AmpereOne. Also, shipping an Armv8.6+ in 24H2 is not a good look when Armv9.0 designs have been coming out since mid-2023. Of course, not supporting Armv9.0 instructions may not be such a big deal for many CSPs, as most of them are probably more interested in maintaining a single software support baseline (Armv8.0), but it is nonetheless the unmistakable sign of a detrimental delay. It also shows that the “custom cores strategy” can be truly unforgiving when concurrent with execution problems.

These past few days (August 2024), Ampere has updated their roadmap, and it certainly makes sense. AmpereOne M is to AmpereOne what Altra Max was to Altra: a derivative design with only one significant update. This allows the company to stay relevant all the while reducing to a minimum the resources allocated to a new design. So, the AmpereOne M is an AmpereOne but with 12 DDR5 memory “channels” instead of 8. This will obviously require a new socket and new motherboards, but these will be made worth investing in by the future advent of AmpereOne MX, a 3nm 256 cores variant using this 12 “channels” platform. This 256 cores variant was announced three months ago. All of this is of course facilitated by the chiplet based design. The novelty here is a future 512 cores product including AI silicon for training and inference, and to be air-cooled.

The idea of running inference on CPUs – at least for some players in the industry – is nothing new and as been discussed at length for example by the excellent Timothy Prickett Morgan from nextplatform.com. Air-cooling, for its part, is certainly a must if you want your product to be successful in the market, as not every datacenter is ready for water-cooling, as shown for example in this great piece from Semianalysis about Nvidia’s Blackwell respin.

A roadmap is nice, but nobody can do compute on paper-launched products. Ampere Computing cannot afford another delay like what just happened with AmpereOne. Going for custom cores probably wasn’t a good idea. To Ampere’s defense, this decision was made a very long time ago – probably at least five years ago – long before the availability of Arm’s CCS IP allowed for incredibly short time-to-market (TTM). But even then, TTM was – and still is – everything in this market. Which customers care about the supposed advantages of cloud native custom Arm cores when the price to pay is a 12+ months delay, especially when Arm’s Neoverse lineup of CPU IP is considered good enough by every single hyperscaler out there?

And at the end of the day, the Carlyle Group isn’t in it for the glory, and will at some point in the future look for a ROI. So, after the AmpereOne family (AmpereOne, AmpereOne M and AmpereOne MX), don’t be surprised if the company pares down its initial custom cores ambitions and we all discover that AmpereOne Aurora adopts Arm off-the-shelf CPU IP. This is all speculation of course, and the future is uncertain. For now, let’s examine...

Nvidia Grace: a special case not designed to compete with traditional datacenter CPUs

There is one last Western Arm datacenter CPU to mention here: Nvidia’s Grace. This is however a special case, and isn’t meant to compete with the likes of Graviton4 or AmpereOne. This CPU is meant to be a companion to the company’s GPUs, and its standout feature is its dedicated interconnect to do just that, along with the enormous chip-to-chip (C2C) bandwidth that goes with it: 900GB/s. Grace can be directly linked to one or two Hopper or Blackwell GPUs, as shown here. Nvidia also offers what it calls the Grace Superchip, that is, two Grace CPU linked together with said C2C interconnect.

Notwithstanding the fact that Nvidia probably doesn’t want to sacrifice its high margin to try and compete in the market for Arm datacenter CPUs, there has been a few deployments of Grace-only configurations, but mostly by public research labs that also happen to have bought a lot of Nvidia’s GPUs. In other words, Grace-only deployments are pretty limited, and further restricted to buyers of Nvidia’s GPUs that probably managed to get a pretty good deal on price. This is especially true of public research labs, where Nvidia might be more than happy to lower its prices in pursuit of further cementing CUDA mind-share among influential software developers. Case in point: the Isambard 3 supercomputer at the University of Bristol.

Nvidia’s Grace memory subsystem also indicates that it wasn’t meant to compete in the broader datacenter CPU market: it uses soldered-on LPDDR5X instead of swappable DDR5 RDIMM modules that every datacenter customer would certainly prefer. This innovation makes a lot of sense for Nvidia, as memory capacity is pretty much a given in Grace typical deployment scenarios (that is, as a necessary companion taking care of mostly I/O operations while the GPUs are responsible for the true compute heavy lifting), and LPDDR5X allows for more memory bandwidth and less power at the same bus width compared to traditional DDR5.

Finally, contrary to what it had done in the past with the ill-fated “project Denver”, Nvidia has opted for off-the-shelf V2 CPU cores here, probably foregoing the damaging NIH syndrome in favor of faster time-to-market. The fact that it didn’t use Arm’s CMN IP for the mesh can certainly be explained by the fact that this IP simply couldn’t accommodate the enormous 900GB/s bandwidth of its C2C interface, as stated above. Hence the custom Nvidia “Scalable Fabric” (SCF) instead of the ubiquitous Arm CMN 700.

The x86 Empire strikes back, first with AMD’s Bergamo

Now that we have taken an exhaustive look at the current Arm datacenter CPU landscape, it is now time to finally investigate the response of the two x86 incumbents to this flurry of successful Arm designs. To be clear, this response doesn’t disappoint. AMD was the first to counter attack with the launch of Bergamo in July 2023. This isn’t entirely surprising as until the middle of this year, Intel was still very busy getting out of the hole it had dug itself into with its 10nm disaster on the manufacturing front, and with the very painful “pipe-flushing” of the older SPR and EMR CPUs on the design front.

Back to Bergamo, AMD’s original datacenter CPU chiplet architecture also facilitated its ability to offer a truly cloud native datacenter CPU option relatively easily and quickly. Whereas every single chiplet-based CSP CPU is organized around a big central compute die surrounded by memory and PCIe I/O dies, AMD’s datacenter CPUs are designed the other way around: a huge I/O die surrounded by small compute dies (that are reused for desktop and high-end laptop products).

So, to address the new emerging “cloud native” datacenter CPU market, all AMD had to do was to design a new small compute die for this precise purpose. It did so with the magic of higher transistor density allowed by lower frequency. Indeed, everything else being the same (process node used, IP being implemented), it is indeed possible to achieve much higher transistor density if you are ready to forgo the very high clocks normally achieved by traditional x86 CPUs.

It all comes together when you consider what a “cloud native” datacenter CPU really is: a CPU with a higher compute density (more cores per socket) at the cost of slightly less performant cores; if the cores are identical, it simply means their maximum frequency will be lower. Compared to a traditional datacenter CPU, a cloud native CPU simply has higher multi-thread performance per socket, at the cost of lower single-thread performance per core. In other words, more cores per socket but the cores are running at a lower frequency (assuming the cores are identical).

Believe it or not, and again, all other things being equal (process node used and IP being implemented), a very high frequency require that the transistors are – so to speak – given space to breathe, and that results in a lower overall transistor density. If you abandon high frequency, you can achieve higher density, even if implementing the same IP on the same process node. Back to AMD’s cloud native endeavor, whereas its “traditional” Zen4 compute die was designed to reach up to 5.7 GHz (on the desktop), its denser Zen4c cloud native compute die tops out at 3.1 GHz. Hence the fantastic density AMD was able to achieve with the Zen4c compute die:

16 Zen4 cores and 32 MB of L3 cache topping out at 3.1 GHz in 73 mm² sq for the Zen4c compute die versus

8 Zen4 cores and 32 MB of L3 cache topping out at 5.7 GHz in 66 mm² sq for the Zen4 compute die.

Zen4c: a density tour de force, together with AVX-512 and SMT

We are simplifying things a bit here, (single CCX in the Zen4 die vs dual CCX in the Zen4c die; no TSV in the Zen4c die; and the Zen4 die has logic to support 96 MB of L3 cache for the 3D V-Cache SKUs), but this is akin to almost doubling the compute density per compute die. However, AMD could only place 8 Zen4c compute dies on its SP5-based package versus 12 Zen4 compute dies for its more traditional offering, codenamed Genoa. Hence Bergamo only tops out at 128 cores per socket (8 73 mm² sq compute dies with 16 cores each; 8*16) while Genoa tops out at 96 cores (12 66 mm² sq compute dies with 8 cores each; 12*8). This discrepancy will be carried over to the Zen5 generation, with the “normal” Turin sporting 16 Zen5 compute dies with 8 cores each (128 cores in total), while Turin dense, the successor to Bergamo, will sport 12 Zen5c compute dies with 16 cores each (192 cores in total).

However, whereas the Zen4 and Zen4c compute dies were implemented on the same process node (TSMC 5nm), the situation will change with Zen5. While the “normal” Zen5 compute die will use TSMC 4nm node (a refined 5nm node), the Zen5c compute die will use TSMC’s brand new N3E (3nm) process node. This is mostly due to TSMC’s decision to change course on its 3nm family of nodes mid-journey after its first iteration (N3B) was deemed too costly, and AMD’s decision to edge their bets by porting Zen5 to 4nm, all the while choosing N3E for the much denser Zen5c. But this is a story for another day.

Back to the matter at hand, the beauty of the Zen4c cores is that they are exactly the same as the Zen4 cores, simply running at a lower frequency. This allows AMD to save on precious engineering resources by not developing a distinct CPU architecture for the cloud native market, all the while offering bonus features like SMT and AVX-512. Truthfully, SMT (simultaneous multi-threading) and AVX-512 are not exactly the first things a CSP is looking for in a datacenter CPU. As we have seen in the previous tables, no Arm processor described above supports SMT, and for good reasons: a CSP’s job is to run its customers workloads, and SMT can introduce the noisy neighbor effect, or – worse still – potentially nefarious interactions between the workloads of two different customers running on the same core. AVX-512, for its part, is only useful for a very specific subset of workloads, and require a software rewrite to be taken advantage of. But at the end of the day, both SMT and AVX-512 basically come for free in AMD’s cloud native offerings and can be disabled/discarded very easily with zero drawback. So, count this as a clear win for AMD, especially compared to…

Intel’s Sierra Forest, the Return of the King

There is *a lot* to say about Intel, about how it dug itself into a hole this last decade, and about how it is now trying to get out. See here for our first part on Intel’s odyssey to Hell and back. Expect the second part before the end of August 2024. In any case, Sierra Forest is the first part out of the company bearing the mark of a true renewal initiated by its new CEO, the Bible quoting Pat Gelsinger. And in fact, 2024 is truly the year when the new Intel will be getting out of excuses, as the “pipe-flushing” era of Meteor Lake, Sapphire Rapids and Emerald Rapids is now truly over, and the new CEO won’t be able to say something along the lines of: “Yes, these parts are underperforming, but it’s not my fault, as they were already well into the design phase when I came onboard!” Furthermore, it is important to distinguish between Intel’s efforts on the design front, competing against the likes of AMD and Nvidia, and its efforts on the manufacturing front, competing against TSMC. Yes, these are two very different kinds of ventures. Four years ago, Intel could already be seen as the last dinosaur, not having gone fabless (if you are ready to count Samsung as a special case due to its huge memory manufacturing operations). Nowadays, with TSMC utterly dominating the leading-edge foundry game in terms of technology (packaging included), volume and breadth of customers, Intel really looks like an out-of-place company in an entirely fabless world. So don’t be surprised if the Santa Clara company ends up spinning-off its fabs in a few years, once it is able to credibly reclaim the technological crown on the foundry front with the advent of its 18A process node.

Back to Sierra Forest (SRF), it is an interesting beast, built in truly unique way. But to better understand this, we will have to take a look at Intel’s entire 2024 server offerings. There will be basically two distinct lineups: the cloud native Sierra Forest (E-cores), and the traditional, higher performance Granite Rapids (P-cores). Each of these two will be implemented on two distinct but common platforms: a 12 DDR5 “channels” platform (Xeon 6900), and an 8 DDR5 “channels” platform (Xeon 6700). Yes, this is a common theme in the industry, as Ampere is doing the same with AmpereOne and AmpereOne M (see above), and so is AMD with its 12 “channels” SP5 socket for Genoa and Bergamo and its 6 “channels” SP6 socket for more reasonably sized products codenamed Siena (not detailed here for simplicity’s sake). Basically, all manufacturers of commercially available datacenter CPUs are trying to balance the need to pass through the memory wall, which requires more memory channels, with the need to keep total platform costs in check for customers with lower compute requirements, hence the availability of variants with only 8 or 6 DDR5 memory “channels”.

As we are comparing x86 and Arm datacenter CPUs here, we will limit ourselves to Intel’s cloud native lineup, based on what the company calls its Efficiency Cores (E-cores). But just to be clear, Sierra Forest (E-cores) is to Granite Rapids (P-cores) what Bergamo was to Genoa, and what Turin-dense will be to Turin: the new cloud native lineups from the x86 incumbents, offering more compute per socket at the cost of lower single-thread performance per core, because this is exactly what the CSPs of this world want: more compute density, even at the cost of lower single-thread performance. As of August 2024, Intel has only launched the 8 “channels” variant of Sierra Forest; that’s the Xeon 6700E in Intel parlance: “6700” for the 8 “channels” platform, and “E” for the cloud native SKUs.

Compared to the in-house designs of the CSPs (a big single compute die surrounded by small I/O dies) and to what AMD has done since Zen2 (a big single I/O die surrounded by small compute dies), Intel has opted for a third way for Sierra Forest (SRF) and Granite Rapids (GNR). These CPUs have huge compute dies that incorporate memory controllers, surrounded on each side by PCIe/CXL I/O dies that are shared between these two platforms. Hence the Xeon 6700E (SRF 8ch) has a single 144 cores compute die, whereas the future Xeon 6900E (SRF 12ch) will incorporates two 144 cores compute dies, toping out at 288 cores with 12 memory “channels”. The advantage here is that the memory controllers are “on die”, that is, on the same die as the compute units, which allows for lower latency and thus higher performance. In more details, the 144 cores SRF compute die incorporates an 8 DDR5 “channels” memory controller. The dual die 288 cores beast will be able to control in total 16 (2*8) DDR5 “channels”, only 12 of which will be used. The drawback, of course, is that these dies are pretty big, and will thus be relatively expensive to manufacture. However, and that’s what’s truly remarkable with Sierra Forest and Granite Rapids, these CPUs are manufactured on Intel’s new 3nm process node, called “Intel 3”. After the terrible fiasco of Intel 10nm process node (that was at some point renamed Intel 7, to better reflect its real density compared to TSMC’s offerings and to better hide the company’s defeat on the manufacturing front), Intel’s 3nm process node is the first competitive node coming out of the company since the glory days of the 14nm era! For exhaustivity’s sake, Intel 4 was just a de-risking step on the road to Intel 3, only used by Intel itself, and only a few libraries were ported to that node; Intel 3 however, will be used by third party customers, and Intel will port all the libraries that it can to it.

That is why Sierra Forrest marks the true return of Intel to the datacenter CPU market: for all intent and purposes, it is a very competitive cloud native offering, with outstanding power efficiency (which is the one thing CSPs care about, as a better power efficiency leads to a lower TCO), and being implemented on a competitive in-house process node, which allows Intel to avoid the TSMC tax and to preserve its margins on these products.

The Pros and Cons of a distinct microarchitecture

The other big difference between Intel’s Sierra Forest and AMD’s Bergamo (and the future Turin-dense), is that Intel uses an entirely different microarchitecture for its cloud native lineup. This microarchitecture is the direct descendant to the Intel Atom line of CPUs, which first debuted in 2008. Over the course of 16 years, it has known many iterations. After first being used exclusively in ultra low power, ultra cheap notebooks, its focus was then enlarged to CPUs designed for micro server applications, starting in 2013 with the very successful Avoton/Rangeley lineup. And in 2021, Intel implemented for the first time the newest version of its low power Atom microarchitecture (now called E-cores) on its highest performing process node, pushing these low power cores to very high frequency compared to previous generations (up to 3.9 GHz instead of 2.4 GHz previously). All in all, from Bonnell to Cresmont, there have been 8 iterations of the Atom line of CPU cores, and starting from Gracemont and then Crestmont, these cores have nothing to be ashamed of anymore in terms of single-thread performance, all the while maintaining very high power efficiency and “transistor efficiency” (they are small). In other words, Intel has figured out the PPA conundrum with Cresmont, and that is why it makes for such a great cloud native CPU.

However, this has two main drawbacks for the Santa Clara company: the lack of SMT support, and the lack of AVX-512 support. And even if, as stated above, these features are not exactly “must haves” in the realm of cloud native CPUs, not having them can be viewed as a relatively important disadvantage for Intel when compared to AMD’s Zen4c and Zen5c offerings.

One last table to rule them all and closing thoughts

Hopefully, after this pretty long exposé, the reader has a better view of the landscape of cloud native CPUs. Three main categories emerge:

·       The three biggest CSPs, which can achieve lower TCO and better control over their supply chains with their in-house Arm CPU designs. Just like Apple, it will probably be very hard to convince them to come back to commercial offerings, as they have enough volume to justify the cost associated with in-house silicon development. AWS is way ahead of Microsoft and Google in this regard.
·       Ampere Computing, the only company left offering commercial Arm datacenter CPUs. With its choice of custom Arm cores, it may have designed itself into a corner, especially in terms of time to market. AmpereOne is late, and once they became available, it is probably not entirely unreasonable to expect third party benchmarks to be at least somewhat underwhelming. However, Ampere has this market basically all for itself, so its future may be safe if it can avoid any future execution problems.
·       The two x86 incumbents, Intel and AMD. They aren’t sitting idle either, and have identified the threat of Arm datacenter CPUs long ago (these things take years to design). Since Zen2, AMD has become a pretty agile player in the datacenter CPU market thanks to its very innovative chiplet-based design, and Zen4c is definitely a design tour de force.

·       Notwithstanding the “noise” around its very disappointing 2Q24 financial results, Intel is back, with a truly competitive cloud native CPU offering (for the first time since 2019) implemented on a truly competitive in house process node, Intel 3. This is a good sign for Gelsinger’s turnaround, but the road will be long and hard.
·       Time to market (TTM) is of paramount importance here.
·       Arm’s CSS IP has reached the perfect sweet spot between TTM accelerant and customizability for customers, and will certainly continue being successful in the future.
·       If Ampere Computing really doesn’t manage to right itself, much more powerful players like Qualcomm or Nvidia may decide to enter the fray.

A series about Intel. Part One: The Fall

François Cattelain — Thu, 11 Jul 2024 14:16:29 GMT

About the series:

To better defend our hypothesis that Intel Foundry will probably be spun-off once 18A ships in volume and becomes profitable (more on that in the follow-up articles), this series about the Santa Clara company will first examine the company’s execution in the last decade, including the infamous 10nm catastrophe and how it all started.

Acknowledgement:

This article and the following pieces wouldn’t have been possible without the numerous articles authored by:

_ Writers at anandtech.com including Anand Lal Shimpi, Ian Cutress, Ryan Smith, Andrei Frumusanu and others.

_ The very friendly Patrick Kennedy from servethehome.com

_ The legendary Charlie Demerjian from semiaccurate.com (some content not behind the paywall)

_ Michael Larabel from the highly recommended phoronix.com

_ David Schor from the very well-informed fuse.wikichip.org

Big thanks to them for making valuable information available to the public.

How it all started: 14nm Broadwell-Y stepping E0 from late 2014

It is a very well documented fact by now that the last few years have not been easy for Intel. Arguably, it all started around the year 2014, 10 years ago. Whatever happened from this point, it was mostly a management problem. Intel’s engineers didn’t become suddenly incompetent from one day to the next. Management however…

2014 was the year Intel was supposed to launch its 14nm process node. It did, but only in a contrived and dishonest way. And that’s how it all started. Broadwell CPUs, the first 14nm processors from Intel, were effectively launched late in 2014, but only in small volume, and only the Broadwell-Y variants were launched. These were very low power processors (4.5 W TDP), tailored for lightweight laptops. The number of SKUs launched was pretty small, too: as little as three. They all had the E0 stepping (a stepping is basically a revision of the design which mostly doesn’t impact functionality but can greatly improve yield and thus manufacturability). However at the start of 2015, Intel discontinued these three processors and replaced them with updated SKUs sporting a new F0 stepping , and complicit OEMs duly updated their laptops too. Basically, Intel played a dirty trick on everyone, most of all on their shareholders.

Why would the company do this? It would seem that that was all about upper management pay incentives. Part of the upper echelons’ compensation package would be tied to “performance”. Performance here being defined as the ability to launch a new generation of product by a fixed date (i.e. before year end xx). But, as the incentives were seemingly very poorly defined, it looks as though it didn’t really matter if the volume was extremely low and if the launch was mostly fake: all you had to do to get your end-of-year bonus was to “launch” the new generation before December 31^st of said year. A few months later, the real products with a new stepping enabling better yield and higher volume would finally launch.

Basically, the company appears to have cheated, with the complicity of a few laptop OEMs, so that upper management would get their performance bonus. Lies about process node readiness and real-world yield were spread around everywhere. It all started innocuously enough: by 2015, when real 14nm volume began to materialize in the supply chain, Intel was still clearly far ahead of the competition (TSMC and Samsung). All was well.

But bad habits took root. On top of everything else, management likely got used to not listening to the engineering side of the business. Why would they? In the end, the company’s engineers always ended up getting it right anyway, even if after a not-so-damaging delay, since the company was so dominant technology-wise at the time. In other words, complacency and poor management appear to have taken root at that point in time.

How it got way worse: the shameful ghost of Cannon Lake at the end of 2018

What Intel’s management did with the 14nm process node, it apparently tried to repeat with the 10nm process node: lie to everybody (shareholders, financial analysts, the press) about real world volume manufacturability, launch a fake SKU before year end to get their performance-tied bonus, and then count on the manufacturing side of the business to eventually get it right (even if that was many months later than originally envisioned), because ultimately, they always do, right?

Wrong. At 10nm, the company’s entire manufacturing roadmap essentially collapsed, with terrible consequences for everyone involved. In more detail, Intel launched a mostly fake 10nm SKU, the Core i3-8121U, just before the end of year 2018, with the same complicit Chinese OEM as last time. Note that this time there was only one SKU, and the integrated GPU was disabled (it wasn’t working). What is more, four years had passed since Broadwell’s launch, and that was twice the delay prescribed by the famous tick tock cadence. In other words, the situation was much worse than it was for the 14nm process node launch.

Cannon Lake never really existed as a product line. There were never any other Cannon Lake CPUs launched, and the product line was later killed. More precisely, it was “killed with fire”, as this commit shows (found via Phoronix).

Charlie Demerjian from semiaccurate.com had an excellent coverage of Intel's 10nm disaster at the time (see for example here, here, here, and here), even though what he wrote at the time probably seemed hard to believe for many people. How could Intel, the undisputed king of semiconductor manufacturing so far, fail so spectacularly? Part of the answer is probably because management got completely careless and complacent about manufacturing issues (see above).

Cannon Lake was the lowest point in Intel’s journey through 10nm hell: a single SKU, only partially working (the integrated GPU was disabled), in a line-up that was dead on arrival. Hard to see any reason for it to exist, except for upper management to get their end of year bonus. The ultimate shameful product.

Muddling through with 10nm Ice Lake: low volume and dubious profitability backed up by 14nm products

With Ice Lake Intel finally managed to launch a full line-up of 10nm CPUs, first for laptops, and then for servers. There were no 10nm Ice Lake desktop CPUs from the company, but more on that below. To better understand how Intel managed the consequences of the industrial catastrophe that was 10nm, we will have to distinguish between these three different categories of CPUs: laptop, server, and desktop, in that order. There is a common theme between Ice Lake for laptops (Mobile Ice Lake) and Ice Lake for servers (Ice Lake SP, for Scalable Processor): both line-ups coexisted with a concurrent range of 14nm products (Comet Lake for Mobile Ice Lake, and Cascade Lake SP for Ice Lake SP). This allowed the company to save face by launching real 10nm products, all the while preserving both its profits and its ability to ship real volume to its customers by continuing to rely heavily on 14nm products for a majority of its volume.

10nm for laptops: allowing the competition to catch up

Approximatively 8 months after Cannon Lake so-called launch, Intel finally managed to roll out the 10nm Mobile Ice Lake range of CPUs, in August 2019. At that point, it bested AMD’s offering of the time, codenamed Picasso. But that didn’t last long, as in May 2020 AMD launched Renoir, the successor to Picasso, and this line-up was clearly better that Mobile Ice Lake. Of course, judging the merits of different laptop platforms is way more complicated than this, as you have to take into account single-thread CPU performance, multi-thread CPU performance, GPU performance, features like video codec capabilities and connectivity, and efficiency in many different scenarios. What is more, everything ultimately depends on how well thought out and refined the OEM implementation of said platform actually is. And that certainly was a problem for AMD at the beginning of its comeback in the mobile market, as its laptop processors had traditionally been confined to low end products with lousy specs and unsatisfactory fit and finish.

Bu there is no denying that Picasso allowed AMD to best Intel’s Ice Lake in the most important metrics, and heralded the company’s comeback as a serious and credible contender at the high end of the laptop market. And this was all because Intel’s entire roadmap had basically exploded at 10nm, allowing its main competitor to stage a long, hard and overdue comeback.

Back to the matter at hand, Ice Lake wasn’t even the only mobile offering from Intel at that time. Most of its laptop products during that period were in fact 14nm CPUs codenamed Comet Lake. The company proceeded with this confusing mash-up (see this excellent article from Ian Cutress for more details) almost certainly because it simply could not afford to have all its laptop CPUs be from the 10nm Ice Lake range, as these were probably not profitable enough, especially compared to good old 14nm ones. In other words, Intel managed to launch a full 10nm mobile line-up with Ice Lake, but it still didn’t manage to launch an entirely profitable one. Which is just another way of saying: 10nm still wasn’t ready to succeed 14nm at that time, even for mobile chips that traditionally favor newer nodes.

The situation for servers: losing face against AMD’s Zen 2 and Zen 3 products

As for servers, the 10nm disaster at Intel and ensuing chaos allowed its arch-rival AMD to stage an even more incredible comeback than in the laptop domain. Arguably, this fantastic resurgence relied heavily on AMD’s Zen architecture intrinsic qualities, including its mind boggling 52% IPC increase from the previous generation (Zen first launched on the desktop in March 2017). Still, when it launched in August 2019 its second generation Zen CPUs for servers, codenamed Rome, AMD delivered a Knockout to its main competitor, as Patrick Kennedy from servethehome.com so brilliantly formulated at the time. This was also due to its very innovative architecture, which arguably heralded the start of the chiplet era. Back then, Intel only had its 14nm Cascade Lake SP available to compete. It didn’t take long for the company to adjust to the new reality by slashing prices by up to 60%. However, in typical Intel fashion, the company didn’t really slash the prices of its existing products. Instead, it refreshed its Cascade Lake SP line-up with mostly identical parts that had different names and much lower prices. This was the company’s typical behavior at that time: avoiding bad press at all costs, doing everything in its power to not spook shareholders, even if it involved misleading – some would say borderline dishonest – behavior.

But Intel’s journey through 10nm hell didn’t end there for server CPUs. The company didn’t manage to launch its 10nm Ice Lake SP line-up before AMD struck a second time with Milan, its third generation Zen product for servers, which launched in March 2021. What is worse, the first wave of Ice Lake SP CPUs – launched in April 2021 – seems to have existed partially just for show (at least for those that could easily be swapped for a 14nm equivalent in terms of core count), as this fascinating piece (again from Patrick Kennedy at servethehome.com) shows. Dated October 2021, this article basically explains that by that date, it was still pretty difficult to buy an Ice Lake based server with 28 cores or less from many big OEMs. All you could easily find at the time (6 months after the official Ice Lake SP launch date) was still only a 14nm Cascade Lake SP based system.

So not only did Intel lost all technical credibility versus AMD in the server world with vastly inferior products starting in mid-2019, but when it finally launched its 10nm server CPUs in Q2 2021, these were not shipped in real volume nor were they apparently as profitable as their in house 14nm counterparts (just like with Ice Lake for laptops). It is important to remember at this point that Intel’s 10nm process was originally supposed to launch in volume in 2016-2017. Add the obligatory two-years delay for a new process node to trickle down to server CPUs (which are bigger and harder to make), and this brings us to a 2018-2019 window for Intel to launch its 10nm server CPUs (if everything had gone well).

Since it is reasonable to consider that Ice Lake SP wasn’t a real 10nm server CPU line-up, as in really shippable in volume all the while maintaining profits, the real deal from Intel only came out in January 2023, in the form of the 10nm Sapphire Rapids (more on that in the next article in the series). From 2018-2019 to January 2023, that’s a four years delay for a real, profitable 10nm server CPUs range to be launched by Intel, compared to what was originally envisioned years before, had the company been able to maintain its two-to-three years process node rollout cadence. This delay is the sign of major industrial accident, happening in a very badly mismanaged company. But we will come back later to this point, looking at the financial side of the story. For now, let’s transition to…

The 10nm disaster on the desktop: 6 years for a part to appear!

When a new process node is launched, it generally brings more in terms of power usage reduction than higher frequency capabilities. This has been true for all new process nodes from the big three (Intel, TSMC, Samsung) since more than a decade at least. Intel’s 10nm process node was no exception. Since 10nm was so hard for Intel to get right, making a competitive, high frequency desktop 10nm part was even harder. Indeed, to grossly oversimplify, compared to laptop or server parts, desktop parts are all about high frequency. Hence the fact that Intel’s 10nm desktop parts took so long to appear. We won’t get into much details here, but it took until Alder Lake’s launch in November 2021 for a proper 10nm desktop part to appear. Knowing that the first Intel 14nm desktop SKUs launched in August 2015 in the form of Skylake, that’s more than six long years for a 10nm desktop part to appear, which is the equivalent of several eternities in the industry.

Still, contrary to what happened for laptops and for servers, this delay did not entirely prevent Intel to stay competitive in this particular market, at least not in terms of pure performance. However, by staying on 14nm, all the while keeping basically the same micro-architecture (going from Skylake to Comet Lake), increasing core count from 4 to 10, and increasing maximum 1-core frequency by 1GHz, Intel had to forgo all pretentions to stay competitive in terms of power efficiency, especially as rival AMD kept on rolling out ever more ambitious (and efficient) Zen-based desktop products.

More to the point, Intel’s previously described strategy of endlessly declining the same micro-architecture over and over finally hit a brick wall after Comet Lake, and it eventually had to do the once unthinkable in March 2021: port a micro-architecture designed for 10nm to the 14nm node. This was an extremely spectacular sign that even as late as Q1 2021, its 10nm process node still wasn’t mature enough to allow for the implementation of a very high frequency desktop part.

Consequence #1: lost competitiveness at the end of the tunnel

Just like good things, all bad things come to an end, including Intel’s journey through 10nm hell. After the relative embarrassment that was Ice Lake, the company finally managed to come up with a version of its 10nm technology that was profitable enough to replace the entirety of its previous 14nm line-ups. This almost certainly involved some kind of rework of the physical implementation of said 10nm process node, but apart from the fact that there is no public information on the matter, these technical details are frankly beyond the scope of this article. These new products launched in a staggered way, as is usual: Tiger Lake (for laptops) launched in September 2020; Alder Lake (for desktops) in November 2021; and finally Sapphire Rapids (for servers) in January 2023.

We will further examine Alder Lake, Sapphire Rapids, Raptor Lake and Meteor Lake (both successors to Alder Lake) in the next article of the series, which explores Intel’s more recent past. However, let’s just say for now that after all these trials and tribulations, in Q1 2024, Intel still isn’t competitive versus AMD in servers (and hasn’t been since Q3 2019 and the launch of AMD’s Rome). As for laptops, the Santa Clara company is in very clear danger of losing mind share and market share to AMD after a clearly disappointing Meteor Lake. And finally, Intel could end up losing badly to Zen5 based products on the desktop knowing that it probably won’t be able to launch Arrow Lake before AMD strikes first with Granite Ridge.

We will come back to all these shenanigans in the next articles of the series, but for now it can be clearly concluded that all the delays incurred by Intel during its 10nm catastrophe have cost it dearly in terms of competitiveness, with consequences still playing out to this day, and probably well into 2025, too

Consequences #2 & #3: 14nm shortages and Intel Foundry launch cancelled

The idea of Intel as a third-party foundry in anything but new. As this 2014 presentation from Intel shows, it is, in fact, more than ten years old. The reasoning – which was already valid a decade ago – is as follows: semiconductor manufacturing facilities are incredibly expensive, and to keep them profitable you need to maintain a high utilization rate. This is pretty hard to do if said facilities serve only a single company, even one as big and diversified as Intel was ten years ago. Indeed, the business of chip making is no stranger to up and down cycles, and this has always been the case. When you are the sole owner and operator of crazily expensive semiconductor manufacturing facilities, your profitability is at the mercy of any downturn your internal chip-making business may encounter. If, however, you diversify the manufacturing side of your business to serve third party customers, then you gain a hedge against any deterioration your own chip-making may stumble upon.

When Intel’s 10nm roadmap crashed and burned, so did its foundry plans: the new 10nm node was very late, with low volume and disappointing performance. Not ideal to entice new customers. Even worse: as real and truly profitable 10nm capacity basically materialized with an approximatively five years delay (from a supposed launch in 2016 to a real-world launch around 2021), Intel had no choice but to continue to rely on good old 14nm for that period. The problem was that this was never the original plan, and so capacity for 14nm became extremely scarce. It didn't help, of course, that demand for datacenter CPUs was booming at the time, and that Intel still had most of that market all for itself during that period (that remained the case even after AMD’s Rome launch in Q3 2019, more on that below). Hence the crushing shortages of 14nm capacity at the time. So, during that period the company had no real 10nm capacity whatsoever, and a terrible shortage of 14nm capacity. Intel had thus no other choice than to cancel the launch of its foundry services.

This, of course, was never officially announced. News of Intel Custom Foundry (as it was called at the time) simply ceased to appear. However, many big names in the industry got badly burned in the process, notably Cisco and LG.

Consequence #4: The IP pipeline got stalled

Furthermore, the consequences for Intel didn’t stop there. When its manufacturing roadmap was rendered essentially invalid by the 10nm disaster, its architectural roadmap also became very severely impacted. Indeed, the company hadn’t planned on its manufacturing wing to essentially stall for more than four years. And as all of the new IP that Intel had planned to launch starting in 2016 was supposed to be implemented in 10nm, all said IP’s launch was consequently badly delayed. The most spectacular example of this delay is Intel’s PCIe gen 4 IP, which came out very late, especially on servers where it was most strategically needed: there was a 7 quarters delay between AMD’s rollout of PCIe gen 4 with Rome and Intel’s deployment of the same IP in servers with Ice Lake SP. Again, seven quarters is almost the equivalent a generation’s entire lifespan. This kind of delay has a big impact on competitiveness, and shows how the catastrophe at 10nm had compounding effects for Intel, with manufacturing difficulties leading to microarchitectural delays, worsening an already bad situation.

The inescapable parallel with the Boeing Company

Before we reach the conclusion of part one, there is one last very interesting angle to examine regarding Intel’s 10nm disaster, and that’s the inescapable parallel with the Boeing Company. Indeed, here are two (former) icons of American manufacturing excellence, who have badly lost their ways because of management gone astray. Obviously, the comparison only goes so far, but there are striking similarities.

The similarities first: “icons of American manufacturing excellence” isn’t overplaying it. Up until the 10nm era, Intel was the undisputed king of semiconductor manufacturing, having an approximatively two years lead over its main competitors in that arena (Samsung and TSMC). The peak of this dominance was reached when Intel introduced FinFET at 22nm, when everyone else was still implementing planar designs (with disappointing results). During its long history, Intel has always cultivated excellence as a company-wide ethos. The same can basically be said of Boeing: when passengers put their lives into your hands, you don’t normally fool around. The company has long had a reputation for engineering excellence, and its historical role in the US industrial landscape earned it a special place in the USA’s self-image.

And now for the bad part. It would seem that just like Intel, Boeing mostly lost its ways because of managements issues. In both cases, management apparently became more interested in short term profits rather than investing in the long run to maintain its engineering excellence as an advantage over the competition. Obviously, the problematic at hand is way more complicated than that, as these are two very big companies, and properly steering them in a complex and moving strategic environment is incredibly difficult. Besides, let’s not forget that insight is 20/20. However, in both cases, it seems pretty obvious that management lost its way by neglecting the manufacturing side of the business, and letting the culture of engineering excellence slowly wither away, all in the name of the single-minded pursuit of short-term profits.

That being said, there are also important differences between the two companies. When Intel’s manufacturing roadmap crashed and burned at 10nm, it still managed to generate record profits at the time, with annual net income of around $20B to $21B in the four years from 2018 to 2021 (the previous record was $13B; more on that discrepancy between industrial catastrophe and record profits below). The same cannot be said of Boeing, which recorded a cumulated $21B negative net income in the three years from 2020 to 2022 (even though Covid certainly played a part). And contrary to Intel’s board, Boeing’s board has only recently – and grudgingly – accepted the fact that management needed a radical change at the top, and said change won’t be implemented before the end of year 2024. Please note that the start of the Gelsinger era for Intel, its new CEO, will be examined in the next articles in this series. Two other big differences between Boeing and Intel are that, contrary to Boeing’s, Intel’s customers don’t directly put their lives in the company’s hands, and that, somewhat tangentially, the failures at Boeing can also be explained by the failures of its public regulator, the FAA.

Profits, market share and the inertia of it all

Before we conclude this part, we must talk about the elephant in the room: the record profits that Intel managed to generate (and the still very high market share that it managed to maintain) right in the middle of what is being called an industrial catastrophe in these pages. Why make such a big deal of a so-called disaster at 10nm if the company managed to earn so much money right in the middle of it? There are several reasons for this.

First, semiconductor design and manufacturing is a high inertia industry.

It takes years to finalize and validate a design, and many more quarters to be able to finally ship in volume a profitable product. And that’s assuming everything goes according to plan. Any problem with the design incurs a two to three months delay, and things can get out of hand pretty fast. So, there is a pretty big delay between the time everything starts to go wrong and the time it begins to show in the financial results. All the analysts and pundits who stated that Intel couldn’t possibly be in such a big trouble since it made so much money at the time (2018-2021) simply misunderstood the nature of this industry. This is indeed a very high inertia industry, and it takes a lot of time for deep-rooted and ugly problems to finally show up in the financial results, especially for a company as huge, diverse and dominant as Intel was until recently.

Second, server CPUs buyers are notoriously conservative.

After the Bulldozer catastrophe at AMD in 2011 (which we won’t rehash here), AMD essentially abandoned the server CPU market, allowing Intel’s market share for server CPUs to skyrocket to more than 95% in the following years and to stay there for very long. And even though starting from mid-2019 AMD had a clearly superior product, it took many years for its market share to slowly take off from zero and reach the 10 percent mark, and then 20 percent mark. We should also take into account here that even with TSMC’s flexibility backing them up, AMD simply didn’t have the financial and technical capabilities to increase its production a hundredfold from one quarter to the next. But back to the matter at hand, even with a clearly superior product, it took many years for AMD to convince enterprise server CPU buyers to finally switch from Intel, because buying Intel had simply become the norm. One interesting thing to note here is that cloud server CPU buyers seem less conservative than their enterprise peers, and were broadly the first to initiate the switch from Intel to AMD.

Third, 14nm saved the day for Intel.

Since it managed to maintain a high (but declining) market share throughout all those years of trouble (2018-2021), Intel was able to extract maximum profits from its trusty and plentiful 14nm process node, which at that time became probably relatively cheap (having been amortized many times over since its launch and ramp), and was still at least somewhat competitive in the 2018-2019 years, especially for laptops and desktops. Indeed, TSMC’s 7nm process node, which heralded the end of Intel’s undisputed manufacturing dominance, only started to appear in this period: Apple A12, launched in September 2018, and AMD’s 3000 series Ryzen desktop products and second gen Epyc server products, launched in the summer of 2019, were all based on TSMC’s 7nm process node. So, for a very long time Intel managed to rely on its old but trusty 14nm process node to achieve real volume and profits, especially as demand for server CPUs skyrocket at that time (see above the part about consequences #2 and #3).

Fourth, it took a long time for Intel’s arch-rival AMD to finally catch-up

After the disastrous launch of its Bulldozer derived line of architectures in 2011, AMD went through a near-death experience in the following years, with its server CPU market share cratering to zero, its laptop CPUs being confined to cheap and crappy models, and its desktop CPUs being simply not competitive at the high end. To the surprise of absolutely no one in the industry, it took many years for AMD to recover and to convince enterprise server CPU buyers and laptop OEMs that its new Zen based platforms were worth their while. (This is of course also a testament to the incredible success of Intel in the 2006-2018 period, when buying Intel was just the obvious thing to do for so many in the market). In any case, after its near-death experience, AMD had become a very tiny company with very limited resources, and they had no choice but to start very small and scale-up very gradually from there. Hence the Zen1 generation comprised basically only 2 dies, one 8 cores die for high end desktops and servers (scaling up to 32 cores per package in servers) and one APU die for laptops and low-end desktops. The later Zen 4 generation however comprises 7 dies in total (2 APU dies with Phoenix and Phoenix small, 2 compute dies (one 8 cores Zen4 die and one 16 cores Zen4c die), 2 IOD dies (one for desktops and one for servers), and one 3D cache die for products with stacked L3 cache). This increase in the number of different designs being churned out per generation is the unmistakable sign of a recovering company, slowly but surely reclaiming the capability to compete in all segments of the market. In any case, Intel benefited greatly from the sorry state of its arch rival at the dawn of the Zen era, and that partly explains why Intel managed to make so much money all the while experiencing what can only be described as a very serious industrial accident, if not an outright catastrophe.

Conclusion: The 10nm catastrophe knocked Intel off its pedestal and has endangered its future

Obviously, this is an overly long introduction to our later arguments that Intel will probably spin-off its foundries once 18A ships in volume and becomes profitable (and if not at 18A, maybe at 14A?). However, the idea here was to properly place everything in its context, and to explain how the 10nm crisis at Intel came to be and what the consequences were.

The 10nm disaster at Intel is one for the history books, and it will be very interesting indeed to see if in the coming years some proper investigative journalists and/or technical experts manage to get some insider’s testimony on how the road to Hell was effectively built at Intel under the leadership of Krzanich (CEO from 2013 to 2018). In any case, this story (“the fall”) is also that of a company that forgot its engineering excellence roots to chase short term profits at the expense of long-term competitiveness, and that is also saying something about American 21^st century capitalism, and its obsession with shareholder value and PR strategies at the expense of the important stuff like building good products. In any case, this industrial catastrophe will have long-lasting repercussions for the company well into 2025, and has already had very big consequences on the leading-edge semiconductor manufacturing industry, cementing TSMC’s rise as the go-to leading edge foundry, with all the geopolitical trappings that that entails.

We are however far from finished, and in part two (“Getting out of the hole”), we will examine Intel’s IDM 2.0 and 5Y4N strategies and the start of the Gelsinger era. Also, notwithstanding the manufacturing side of the business, Intel began to experience a lot of difficulties churning out competitive designs on time (11 steppings for Sapphire Rapids!), and made other dubious deign choices (the over-designed and over-complicated Ponte Vecchio and Meteor Lake come to mind as examples). Stay tuned for the next part!

On the AI boom and Nvidia’s current hardware dominance

François Cattelain — Fri, 13 Oct 2023 07:19:05 GMT

It is very hard these days to miss the frenzy surrounding all things AI, and Nvidia is by far the company benefiting the most from the current boom. The firm’s margin in the second quarter of this year (Q2 FY2024 in financial speak) was 70 per cent, and even taken in the historical context of the semiconductor industry, it’s a huge number. In fact, it’s an even bigger margin than what famed semiconductor juggernaut Intel managed in the heydays of its near monopoly on x86 datacenter CPUs (general purpose processors).

Nvidia’s very high margin this last quarter is the sign of a clearly overheated market for datacenter GPUs (AI accelerators, to grossly simplify), where demand far outstrips supply and where prices are consequentially severely inflated. What is more, the delivery time for Nvidia’s latest and greatest, the H100 GPU, is hovering around 40 weeks.

After the overblown hype that we have witnessed these past years following the advent of 5G and autonomous vehicles, and putting aside the seemingly utter madness that constitute cryptocurrencies, asking whether or not the current mania surrounding AI amount to a bubble would seem like a rather legitimate question. This is of course not to say that 5G, autonomous vehicles and AI are useless or irrelevant technologies. They are not. And they will certainly change the world profoundly. But that doesn’t necessarily prevent their respective hype cycle from leading to unrealistic expectations, and potentially result in market disruptions (like an unforeseen decrease in demand), and maybe even financial loss for some over-ambitious start-ups.

Disruptions may be inevitable

The fact that demand for AI compute may somehow subside once the initial enthusiasm recedes is probably not outside the realm of possibility. First of all, Artificial Intelligence is hard. Reaching satisfactory results requires much more time, effort, and scarce financial and human capital than what was required to run a successful online business during the dot com bubble at the start of the century. Some companies may underestimate these costs and difficulties, while some others still seem to be fooling around with dubious results.

Then there is the matter of the datasets (think of it as the raw material – made of data – from which AI models are built). No matter how much compute you throw at the problem, your end result will only be as good as your starting dataset. Depending on the application, some companies may underestimate the time and effort required to build one that is good enough.

Compounding these factors are the huge costs currently associated with AI compute. Many businesses are currently confined to the cloud, either for the flexibility it provides, or because they balk at the huge capital expenditures presently necessary to build their own capacity. But the cloud can be famously expensive, especially when using already scarce and overpriced AI hardware. Once they discover how hard it can be to achieve actually good results, and taking into account the huge costs currently associated with AI compute, some companies may simply scale back their current AI ventures in the current quarters.

Finally, China is yet another wild card in this broader market. The current ever-tightening US sanctions on the country push many Chinese companies to buy as much Nvidia hardware as they can, as fast as they can, and sometimes no matter the price. This situation results in overinflated prices, but it is obviously a temporary one.

Three broad categories of users

To better understand the matter at hand, we will simplify things by focusing only on businesses that buy datacenter GPUs by the tens of thousands, and by segmenting this market into three different and somewhat overlapping categories: first the Cloud Service Providers (Amazon’s AWS, Microsoft, Google, Oracle and so on), who provide compute capacity to their paying customers, then the Internet Giants (Google, Meta, Microsoft, etc.), who need this capacity for their own internal needs, and finally a few ambitious and extremely well funded start-ups, namely OpenAi, Inflection and Anthropic. For more information on these “GPU-rich” actors, Dylan Patel from Semianalysis has an excellent write-up on the subject, as usual.

OpenAi, the creator of ChatGPT, doesn’t need an introduction at this point. Originally a non-profit, it has recently created a for-profit subsidiary. Inflection, for its part, is trying to build some kind of AI-powered digital companion that individuals would pay for. And finally, Anthropic specializes on research to help make the use of AI safer.

For the sake of exhaustivity, one could argue that a fourth category exists, comprising all kind of other actors (for instance in the petrochemical and pharmaceutical sectors) that need to buy tens of thousands of datacenter GPUs to help them build their end products. But in the interest of simplification, these businesses can be considered as outliers. One such prominent outlier is the car manufacturer Tesla. In any case, the automaker is clearly an exception, as it is building its homegrown Dojo AI supercomputer to supplement the GPUs it buys from Nvidia.

Ultimately, the distinction into these distinct categories is a glaring oversimplification, but the idea at this point is to provide the reader with an accessible framework to better understand the current state of affairs.

The crux of the matter here is that the Cloud Service Providers (CSPs) have a huge number of different paying customers, each with their own distinct workloads and circumstances, and it would seem rather unlikely that the demand from all these disparate customers would suddenly decline at the same time. The idea is the same for the Internet Giants: they have an immense internal need for datacenter GPUs to process the vast trove of data they operate on, and it seems difficult to imagine a future where this need abruptly disappears.

Let’s now examine our third category of “GPU-rich” companies: the start-ups. These businesses mostly have a single purpose, and they are responsible for a remarkable share of the current demand for datacenter GPUs, especially relative to their size. These corporations offer the most serious potential for a sudden contraction of the demand for Nvidia’s GPUs.

Predicting whether or not these start-ups will be successful is frankly beyond the scope of this article, but one should note that OpenAI is already making money from many different customers, whereas Inflection and Anthropic seem rather more like long shots. And taking into account the overinflated expectations the market has systematically generated over the past two decades regarding technological disruptions, it seems reasonable to harbor a healthy dose of skepticism. In any case, the idea here is to better understand where the risk of a future slackening of demand may reside, to help better quantify it.

The cushioning role of the CSPs and the Internet Giants

Back to the first two categories of buyers of datacenter GPUs: they have a structural capacity to absorb some hypothetical future demand-side shock, either due to the huge number and diversity of their customers (in the case of the Cloud Service Providers), or due to their sheer size (in the case of the Internet Giants). For some of these companies (Google, Microsoft), their dual roles would allow them to cope even more efficiently: any unused Cloud GPU could then be repurposed for internal use, or vice and versa.

In case of a serious market correction in the coming quarters, we may however witness some serious internal readjustments among the Internet Giants, and, for instance, a reshuffling of resource allocation among the different teams inside these very large companies, as the environment evolves and priorities shift.

As for the start-ups, even in the worst case, a bankruptcy would simply result in a sudden influx of second hand datacenter GPUs flooding the market. At that point, the CSPs and Internet Giants, fulfilling again their cushioning role, would then be certainly more than happy to snatch up such precious hardware at bargain price.

However, a slackening in the demand for GPU compute would necessitate from the part of these giant companies a prolonged period of “digestion”. That would in turn severely impede Nvidia’s capacity to keep on selling such huge quantities of datacenter GPUs in the coming years, especially as a new generation of hardware is on the horizon. A successor to the current Nvidia H100 GPU is indeed expected in the 2024-2025 time frame. In other words, the party may not last forever for Nvidia, and demand for its next generation accelerators may be lower in the future than what we are seeing right now, especially if broader sentiment significantly shifts in the market.

A broader look at the market for AI accelerators

What is more, even though Nvidia is by far the company that benefits the most from the current AI boom, it is not alone in the market for AI accelerators. AMD is hot on their heels with their brand new and innovative MI300 family of products. More importantly, its accompanying software ecosystem (called ROCm) slowly but surely catches on with Nvidia’s famed Cuda software tools. As a matter of fact, Nvidia is famous for having more software engineers than hardware engineers, and this constitutes one of the very key to its current success.

Then there is Intel, hard at work trying to correct its past errors. Even though these past mistakes have been mostly – but not exclusively – related to its semiconductor manufacturing activities, the company has recently reset and delayed its GPU roadmap for what can arguably be described as the third time in the last two decades. However, its Falcon Shores family of products is now expected before the year 2025 is over, and just like AMD, the company has patiently built in the past years its obligatory accompanying software ecosystem, called oneAPI.

But that’s not all, as beyond these three incumbent players, there is a variety of hardware start-ups hungry for success, the most prominent of which is Cerebras with its very innovative Wafer Scale Engine. In a remarkable achievement, Cerebras has recently won a $100M contract from Abu Dhabi based Group 42 for its second generation product. In the same vein, SambaNova and Tenstorrent are two other prominent competitors worth mentioning.

Competitors and customers enter the fray

However, beyond Nvidia’s competitors, it may very well be Nvidia’s own customers that will push the hardest for the company to reduce its prices, and thus bring back its gross margin to “reasonable” territory. They will do so either by adopting Nvidia’s competitors’ products, or, more remarkably, by deciding to build their own.

Beyond the newfound structural importance of the CSPs and Internet Giants in the broader semiconductor market, the other paradigm shift in the industry these past years has been the ability of many of these companies to simply build their own products, instead of buying what the hardware vendors have to sell. For example Apple with their A series of processors for iPhones and M series for laptops, AWS with its Graviton CPUs, Trainium AI accelerators and in house SSD controllers, Google with its TPU AI accelerators, and Tesla with its homegrown Dojo AI supercomputer.

This has been made possible by the emergence this past decade of a broader ecosystem of many different companies offering distinct services to help with this endeavor: for instance businesses like Cadence and Synopsis offer the necessary intellectual property, Marvel, Broadcom and Global Unichip Corp offer design services, and TSMC, Samsung (and very soon Intel) offer third party manufacturing.

To summarize, a significant decrease in the demand for AI hardware in the coming quarters seems pretty unlikely at this point. What could happen however is that after the current manic phase comes a prolonged period of hardware digestion. That would lead to subdued demand for next generation accelerators, as the Cloud Service Providers and the Internet Giants reduce their current levels of capital expenditures in this regard.

More importantly, the very structure of the semiconductor industry has an important role to play in this matter. The CSPs and the Internet Juggernauts, by their sheer size and diversity, may be able to cushion any future sudden decrease in demand. But they also weigh very heavily on the industry in one other fascinating way: instead of buying their future hardware, they may simply build it on their own, thanks to a new ecosystem of companies dedicated to making such a complex task achievable.

That would allow them to recover a bigger part of the value created by the AI revolution, instead of letting Nvidia get away with such a big part of the cake. For now, Nvidia’s party is in full swing. But nothing lasts forever, and a 70 per cent gross margin seems to be simply unsustainable in the long run.