At QCon San Francisco 2024, Meta’s Ye (Charlotte) Qi spoke about the challenges of running large language models (LLMs) at scale. Her presentation delved into the difficulties of managing these massive models in real-world systems, emphasizing the obstacles posed by their size, complex hardware needs, and demanding production environments.
Qi compared the current AI boom to an “AI Gold Rush,” where innovation is happening rapidly, but significant roadblocks are still in the way. She explained that deploying LLMs effectively goes beyond just fitting them onto existing hardware. The key is to maximize performance while keeping costs manageable, which requires strong collaboration between infrastructure and model development teams.
Fitting LLMs to the Hardware
One of the main challenges with LLMs is their high demand for resources. Many models are simply too large for a single GPU to handle. To solve this, Meta uses techniques like splitting the model across multiple GPUs with tensor and pipeline parallelism. Qi stressed the importance of understanding hardware limitations, noting that mismatches between model design and available resources can severely impact performance.
Her advice was clear: be strategic. She recommended finding a runtime specialized for inference serving and really understanding your AI problem in depth before choosing the right optimizations.
Maximizing Speed and Responsiveness
Speed and responsiveness are crucial for applications that depend on real-time results. Qi highlighted methods like continuous batching to keep systems running smoothly, and quantization, which reduces model precision to make better use of hardware. These adjustments, she noted, can significantly boost performance, sometimes doubling or even quadrupling it.
From Prototypes to the Real World
Taking an LLM from a lab environment to full production is a major challenge. Real-world conditions bring unpredictable workloads and strict requirements for speed and reliability. Scaling is more than just adding GPUs; it’s about carefully balancing cost, performance, and reliability.
Meta addresses these issues with techniques such as disaggregated deployments, caching systems that prioritize frequently accessed data, and request scheduling to ensure efficient operations. Qi pointed out that consistent hashing—directing related requests to the same server—has been especially helpful in improving cache performance.
The Importance of Automation
Given the complexity of these systems, automation plays a crucial role in managing them. Meta uses tools to monitor performance, optimize resource usage, and streamline scaling decisions. Qi mentioned that Meta’s custom deployment solutions allow the company to adjust services in real time, keeping costs under control while responding to changing demands.
Looking at the Bigger Picture
For Qi, scaling AI systems is not just a technical challenge—it’s a mindset. She encouraged companies to step back and consider the bigger picture, focusing on what truly matters for long-term value. An objective perspective helps businesses focus on refining systems and improving real-world impact.
Her main takeaway was clear: succeeding with LLMs requires more than just technical expertise in model and infrastructure development—though these are critical. It’s also about strategy, teamwork, and delivering real-world results.