Inside Amazon’s Chip Lab: Where Trainium Gets Built
Amazon’s custom AI chip program started in 2015 when the company acquired Israeli chip designer Annapurna Labs for $350 million. More than 10 years later, that team has produced Graviton CPUs, Inferentia inference chips, and the Trainium series that now powers Project Rainier, one of the world’s largest AI compute clusters. The process of building AI chips from design to production involves an intense phase called the “bring-up,” where engineers work around the clock to activate a new chip for the first time. Here is how Amazon builds and tests the hardware behind its AI ambitions.
What Happens During a Chip Bring-Up
- Engineers work 24/7 for three to four weeks to verify each new chip design
- The Trainium3 bring-up required emergency grinding of a heat sink that had wrong dimensions
- The chip is a 3-nanometer design manufactured by TSMC
- Liquid cooling replaced air cooling for energy efficiency
- Engineers weld tiny integrated circuit components under a microscope
The Bring-Up: A Silicon Lock-In Party
After 18 months of design work, the chip is activated for the first time. Lab director Kristopher King described it as “a big overnight party, like a lock-in.” The team stays in the lab, testing, fixing, and verifying until the chip works as designed.
For Trainium3, the prototype was originally air-cooled. The production chip uses liquid cooling, which required different dimensions for how the chip attaches to its heat sink. During the bring-up, those dimensions were off. The chip could not be activated.
“The team immediately got a grinder and just started grinding off the metal,” King said. To avoid disrupting the pizza party atmosphere, they snuck off and did the grinding in a conference room.
Micro-Welding Under a Microscope
The lab includes a welding station where hardware engineers solder tiny integrated circuit components through a microscope. Isaac Guevara, a hardware lab engineer and master welder, demonstrated the process during a lab tour. The work is so precise that senior leader Mark Carroll openly admitted he could not do it, drawing laughter from the engineering team.
The lab also houses both custom-made and commercial tools for testing and analyzing chip components. Signal engineers use specialized equipment to test each tiny component on the chip individually, isolating potential issues before mass production begins.
From Lab to Data Center
The team designs not just the chip but the entire server that hosts it. “Sleds” are trays that hold Trainium AI chips, Graviton CPUs, and supporting boards. Stack them on a rack with custom Neuron networking switches and you get the systems that power Anthropic’s Claude.
Amazon also operates a private data center for quality and testing. Located near the Austin lab, it does not run customer workloads. Security is tight, with strict protocols to enter the building. The cooling system is so loud that earplugs are mandatory, and the air has the acrid smell of heated metal.
What Comes After Trainium3
The team is already working on Trainium4. With OpenAI, Anthropic, and Apple as customers, the pressure to deliver is intense. AWS CEO Andy Jassy called Trainium a multibillion-dollar business and one of the AWS technologies he is most excited about. Each new generation needs to improve on cost, performance, and power efficiency. The bring-up cycle will repeat, and the engineers will be back in the lab, grinding metal and eating pizza until the next chip comes to life.