by Raghav Garg, MTS, Inflection AI
Inflection AI recently partnered closely with Intel to port our LLM inference stack to the Gaudi accelerator (HPU). Through adapting and optimizing our custom inference runtime, we solved a myriad of hard challenges of moving from an NVIDIA-based ecosystem to an alternate accelerator and building hardware flexible software stacks, while maintaining high performance standards.
It's all about the Ops
The challenge of mapping AI architectures to the underlying accelerator hardware is the key to achieving efficient and scalable AI workloads. The number of torch ops over the last 10 years has ballooned to over 2000 and SynapseAI, the software backend for Gaudi, supports a subset that generalizes over the other ops. However, while much of Inflection’s model architecture is natively supported by the Habana software stack, operations like pythonic (numpy based) tensor slicing and more obscure ones like torch.triu_indices were not. For tensor slicing, some of these ops on large tensors resulted in hard-to-diagnose segfaults (a case where print statements >> profiling), while unsupported ops resulted in implicit fallback to the CPU. Although this workaround allowed our model to run to completion, we observed a huge performance cost: transferring tensors to CPU, executing on the CPU, and then moving data back to HPU introduced latency that dwarfed native HPU compute times. In both of these cases, our solutions were to rewrite the unsupported operations in terms of others that ran performantly. After doing so, we saw roughly 15x speedup (and, of course, no segfaults). At this point in the project we were still not at the performance we saw from execution of the same PyTorch inference runtime on an H100. However, we now saw a path to get there.
Execution Frameworks: The Eager brown fox jumps over the Lazy dog
The Lazy loading vs. Eager loading dichotomy is found in many areas of computing, with many AI software and hardware stacks trying, with various levels of success, to support both. When we first brought the model up on Gaudi, we started with Eager execution mode (the default PyTorch behavior) for simplicity. In Eager mode, each operation is executed one-by-one on the HPU as encountered. This “op-by-op” execution worked out-of-the-box, but as mentioned above, resulted in higher latency than NVIDIA. Like many newer AI chips, Gaudi hardware is optimized for graph execution and launching ops one at a time incurs some Python overhead and reduced opportunity for fusion.However, switching to Lazy execution which is Gaudi’s default optimized mode came with its own challenges, that is, dynamic ops. Dynamic operations, such as data-dependent branching or variable input tensor shapes, break the lazy graph of operations, introducing host overhead from reaccumulating operations and generating new graphs. As a result, the naive implementation with Lazy mode execution was twice as slow as Eager mode. After iteratively identifying and removing every dynamic operation causing graph breaks in our forward pass, we then integrated HPU graphs. Analogous to CUDA graphs, HPU graphs enable recording and replaying computation graphs directly on the HPU without involving the host. Since they operate on fixed shapes of input tensors, we used a bucketing strategy with padding for variable input tensor shapes to match the shapes of our cached HPU graphs.Through these optimizations, we achieved a 4x speedup that rivaled our model’s performance on NVIDIA with hardware specific kernels.
Profilers, Profilers, Profilers
Observing AI workloads on the accelerator is critical to identifying compute and memory bottlenecks during its execution. Somewhat analogous to the H100, which can overlap TMA‐based GEMMs executed on the Tensor Cores with exponential computations handled by the special multi‐function unit, we concurrently use the Gaudi’s MME for GEMM operations and a separate Tensor Processing Core (TPC) for element-wise operations. Keeping both of these fed is necessary to approach the theoretical maximum throughput of the accelerator. To further boost performance, we worked with the Intel Habana team to split our attention mechanism into two parts. While the MME handles the heavy matrix multiplications for the current layer’s attention computations, the TPC simultaneously begins work on the next layer. Breaking complex monolithic kernels into smaller operations that provide data parallelism to feed all hardware units proved to be a delicate but very useful design.
Looking Forward
The success of adapting our large language model inference stack to Intel’s Gaudi accelerator underscores a broader principle: hardware design patterns tend to recur across different accelerators, so the insights gained from optimizing one platform guide and streamline our efforts on every new architecture. Through careful rewrites of unsupported operations, managing dynamic shapes, and thorough profiling, we developed methods that can scale with model sizes ranging from 1B to over 100B+ parameters. Although our immediate focus was on Gaudi, and of course we are excited to build on future architectures as well, many of the solutions described above such as minimizing CPU fallbacks and splitting attention workloads apply to a variety of accelerators. We remain dedicated to improving performance and flexibility, and we welcome those who share our vision of using new, more efficient hardware to advance large-scale inference. Together, we can expand the boundaries of what’s possible across an ever-evolving landscape of AI accelerators. We’d love to hear from others navigating similar paths—share your experiences, challenges, and wins with alternative accelerators in the comments!