Hello again! I just got back from attending the Australasian Leadership Computing Symposium (ALCS). This was one of the first in-person conferences of the year for me, so it was great to both catch up with old friends (some of whom it was my first time seeing them not in a Zoom call) and to meet cool new people. Because ALCS is a national supercomputing conference, there were an unusually high number of weird computer people and, since I am also a weird computer girl, I spent a fair bit of time just shooting the breeze about fun computer esoterica. Sure, there were lots of conversations about Very Serious Topics in HPC, but those are way less fun and memorable than trying to out-nerd each other and see who can dredge up the wackiest, longest-obsolete architecture. One motif that kept popping up throughout these conversations was that “everything old is new again”: everything from hardware-level microarchitecture emulation to old-school vector machines are seemingly seeing a resurgence in the HPC space in some form or another.
I obviously love weird computer stuff because it’s fun, but I think there’s a genuine benefit to looking back at weird stuff from the past and trying to figure out why something did or didn’t work. No computer company deliberately sets out to make a terrible CPU for nerdy transfemmes to semi-ironically obsess over 1; usually, the weird-seeming parts of a CPU are the result of either:
- Some historical context that everybody’s since forgotten about,
- Sincere but misguided attempts at balancing trade-offs and constraints, or,
- Good old-fashioned poor execution of ideas.
All three of these causes are worth exploring: engineering contexts and constraints tend to go in cycles, and even badly-implemented ideas can be worth another crack if the problem they’re trying to solve is still relevant. To be concrete, I’m going to use my favourite obsolete architecture, the Intel Xeon Phi co-processor, as an example throughout this piece, but the general thought process should be pretty generalisable.
This is kind of a shitpost but also kind of serious? I’m mainly just writing this because I’m keyed up after the conference, but beyond that, hopefully this post will help convince you that computer archaeology is both a fun and useful hobby for the HPC practitioner (and if not, it makes for totally great conversation at nerdy cocktail parties!).
What do we mean by “weird”?
Terms like “wacky” and “weird” are primarily aesthetic judgements, but I think there’s generally an (admittedly vague) pattern behind the kinds of computers they get applied to.
Broadly speaking, a weird CPU is one that has features substantially outside the (then) current norm or the current norm’s evolutionary lineage. These days, that pretty much means “anything not Arm or x64”. Usually things get to be judged/adored on the standards of their time, like how a SPARC machine from 1999 is not necessarily weird, but a computer running SPARC in 2023 is definitely weird (and cool). Evolutionary dead-ends are great examples of this. Speaking of, let’s talk about the Xeon Phi.
Why did the Xeon Phi flop?
The title of this section may sound extremely negative, but I actually kind of love the Phi. The Intel Xeon Phi (codenamed Knight’s Corner and Knight’s Landing) were Intel’s attempt to gain a foothold in the HPC accelerator market without fully committing to just making a GPU. They were still x86 processors, but with “GPU-like” features such as lots of cheap hardware threads, huge AVX-512 SIMD lanes, relatively limited instruction-level parallelism, and some high-bandwidth memory (HBM) prioritising throughput over latency. I mean, just look at this glorious thing!
The Xeon Phi was advertised as a way to get many of the benefits of using a GPU for HPC, but without having to port your codes to CUDA - just compile your regular CPU code with OpenMP and away you go. It’s an enticing idea, and one that I don’t think is entirely without merit.
Unfortunately, the Phi tended to pretty disappointing in practice. While it was technically possible to run vanilla OpenMP codes on the Phi without any changes, the practical performance of most scientific codes tended to be really bad without doing some major re-writes to the code’s structure. It turns out that in order to get maximum performance from a GPU-like architecture, you need to write code that looks an awful lot like CUDA in structure (even if not in syntax) and the Phi required very careful management of the HBM and main memory, very close attention to thread and vector instruction dispatching 2, while paying much closer attention to latency-hiding than on a conventional CPU. The net result is that most scientific codes required so many changes that you might as well just port it to CUDA, which kind of undermined the initial premise. Intel discontinued the product line in 2018.
Like I said, I have a special fondness for the Phi because I actually did spend the effort to get AMBiT running well on it. NCI (one of Australia’s national supercomputing centres) had a whole bunch of Phi nodes just sitting idle, so I was able to have more-or-less exclusive access to run my calculations. It was a huge improvement in performance and time-to-results compared to the CPU nodes available to me at the time - I probably wouldn’t have gotten the results for at least two papers ready by my self-imposed deadline without it. But, crucially, that was back when I was a PhD student and had time to spend mucking around with this kind of thing; most academics could not spare the time, hence why the nodes were always idle.
I still have a bunch of Intel’s technical manuals and programming guides from back then and they’re some of my favourite books in my collection. Like I said, I think there were some interesting ideas in the Phi (even though the implementation was not so good) and looking back from the context of 2023, some of the design choices feel outright prescient.
Sapphire Rapids: Xeon Phi v2.0?
No, but kind of.
Let me explain.
Sapphire Rapids is the codename Intel’s newest (as of writing) performance-focused, server-side processor, and, while it isn’t a repeat of the Phi, it does rhyme. Compare the Sapphire Rapids with the Knight’s Landing:
- Depending on the model, the Phi had between 64 and 72 physical cores, with 4 threads per core. The Sapphire Rapids has between 32 and 56 cores, with 2 threads per core.
- Both models make heavy use of AVX-512 vector instructions.
- Both models have both main memory and a smaller amount of HBM, aiming for bandwidth rather than latency (compare to regular CPU cache memory).
- Both have kinda funky, non-homogeneous memory access performance (ring buffer in the Phi vs chiplet architecture in the Rapids).
Obviously the Sapphire Rapids have a bunch of extra features that the Phi didn’t (notably special-purpose hardware accelerators for e.g. cryptography or matrix operations), but there’s enough overlap here that I feel like this is not a coincidence. I can’t profess any special insight into Intel’s design process, but I think it’s safe to say that it’s exploring a similar set of trade-offs in the design space to the Phi. A lot of the weird stuff Intel tried with the Phi is a way to efficiently utilise a huge number of processors given limited power budgets and limited memory speeds compared to compute power. AMD is also making increasingly GPU-like CPUs, but is exploring a slightly different part of the design space focusing on latency via large caches instead.
The point of all this is not to say that the Xeon Phi was a perfect cinnamon bun of a processor, but rather to point out that the fundamental tradeoffs of manycore performance haven’t changed that much since 2018. The Memory Wall is still a thing (probably even more so now than in 2018), power constraints are still a thing, SIMD is a very efficient way to expose a lot of parallelism with comparatively little silicon, and so on.
Computers are getting weird again and it’s awesome
For most of my adult life, the fundamental design ethos of personal computers was pretty much static: it was x86(-64) CPUs, discrete GPUs (if you were lucky), DDR memory and some kind of persistent storage (either spinning rust or flash). There was a little bit more variance in the enterprise/scientific space (the Summit supercomputer running on POWER9 CPUs springs to mind), but the basic contours were the same. Intel (and later AMD) completely dominated the enterprise CPU market from the 90s onwards through a combination of economies of scale, process and design improvements, and network effects around the x86 ISA 3. There also wasn’t much need to make fundamental changes to chip design as Moore’s Law and Dennard scaling were still more or less a free lunch as far as computer architecture goes. Supercomputer design and integration kept pace with the improvements in low-level electronics engineering, so we could keep building out ever bigger parallel machines and still keep a handle on running them. General-purpose GPUs were a step-change for HPC, but adoption lagged behind CPUs for quite a while (at least in Australia) due to the need for substantial code re-writes to take advantage of them in legacy codes.
That seems to be changing nowadays and we might be seeing the first cracks in the old paradigm. Some things that have caught my interest recently:
Intel and AMD making increasingly “GPU-like” CPUs, but making different sets of trade-offs to get there.
AMD and NVIDIA making integrated CPU-GPU “superchips” (MI300A and Grace Hopper respectively), where both host and device share silicon (including a unified, high bandwidth physical memory address space). In my opinion, we’re probably reaching a practical limit of how many CPU cores we can stuff on a single node 4, so why not use one of the sockets for a GPU instead?
Custom Arm chips from the likes of Apple, Amazon, Microsoft and Fujitsu/Riken for the Fugaku Supercomputer, each making a different set of performance decisions and targeting different kinds of workload (rather than trying to be one-size-fits-all).
NEC making honest to goodness vector processors in a PCIe form factor. This is my favourite of the “Back to the Future” machines because I serendipitously learned NEC was making these right after having a great chat with my boss about her experiences programming vector architecture computers back during her PhD. I have no idea if this will have any staying power (my guess if probably not), but it is wild design and I love it!
Domain-specific hardware for fields like AI (Google has been doing this for ages with its tensor processing units). Anton continues to be a “fire breathing beast for molecular simulation”, too, showing that this approach potentially has legs for more traditional scientific domains.
FPGAs are getting easier to program and better support from open-source toolchains. Jury’s still out on whether they’ve finally reached the break-even point in terms of developer productivity vs performance (at least within my specific field), but it’s looking promising?
RISC-V for HPC, maybe? Hopefully? One day?
I think the underlying factor behind a lot of this is the end of Dennard Scaling and Moore’s Law. Power consumption and the associated problems with heat dissipation are really starting to bite for performance-sensitive machines, so we can’t just throw more transistors at the problem. Instead, system architects are going to have to be more strategic by trying to minimise unnecessary silicon for a given kind of workloadad, whether by doing the kind of hardware-software co-design we’re seeing at Apple, or by going for full-on domain-specific hardware.
If computers are starting to get weird again, it’ll probably take a while before things settle into a new paradigm. As an HPC application developer, it’s probably going to suck for a while since there won’t be a “default” platform to target. It’s already a hassle to support cross-platform GPU applications given the absurd amount of lock-in between vendors, despite being fundamentally quite similar GPU architectures. This is probably going to be the norm as we get more variance in the hardware space, so I predict it’ll be a lot of work to maintain cross-platform HPC codes.
But as a weird computer girl, this is awesome and I can’t wait to see what wacky stuff I get to play with next!
The Phi had limited branch-prediction and somewhat higher initial pipeline latency, so you had to oversubscribe the physical cores with multiple hardware threads to hide latency in branchy code. ↩
Not a hard limit in terms of what’s physically possible, but it seems to me that when you consider the practical performance gains for most applications and compare to the extra costs imposed by running on a single node (i.e. memory and cache coherency protocols, and heat dissipation), the juice is probably not worth the squeeze. ↩