Artificial intelligence (AI) is becoming increasingly prevalent in our daily lives, with AI-powered products and services in high demand. This rise in popularity, especially for large language models like ChatGPT and image generation models like Stable Diffusion, has also brought greater attention to the computational and environmental costs associated with AI, particularly in the area of deep learning.
The cost of deep learning is primarily influenced by three factors: the size and structure of the model, the processor it runs on, and how the data is represented. Over the years, AI models have grown larger, with their computational requirements doubling every 6-10 months. While processor power has improved, it hasn’t kept up with the increasing costs of AI models. As a result, researchers are exploring different ways to optimize the data representation to reduce these costs. The choice of data type has significant implications on power consumption, accuracy, and throughput. However, there’s no single best data type for AI, as the needs vary between the training and inference phases of deep learning.
Finding the Right Balance: Bit by Bit
One of the key methods to improve AI efficiency is through data quantization. Quantization reduces the number of bits required to represent the weights of a model, which makes the model smaller, speeds up computation, and lowers power consumption. This technique is critical for improving AI efficiency.
AI models are typically trained using 32-bit floating point (FP32) data. However, it has been found that not all 32 bits are necessary to maintain accuracy. Attempts to use 16-bit floating point (FP16) data types have shown early success, leading to efforts to find the minimum number of bits required for accuracy. Google introduced the 16-bit brain float (BF16), and models being prepared for inference are often quantized to 8-bit floating point (FP8) or integer (INT8) data types. There are two main approaches to quantizing a neural network: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Both methods aim to reduce numerical precision to improve efficiency, but they differ in timing and implementation, affecting accuracy.
Post-Training Quantization (PTQ) occurs after a model has been trained with higher-precision data (e.g., FP32 or FP16). The model’s weights and activations are then converted to lower-precision formats like FP8 or INT8. While this method is simple to apply, it can lead to accuracy loss, especially when using very low-precision formats, as the model hasn’t been trained to handle the errors that arise from quantization.
Quantization-Aware Training (QAT) incorporates quantization into the training process, allowing the model to adapt to the reduced precision. During training, both the forward and backward passes simulate quantized operations, adjusting the model to handle the reduced precision better. QAT generally results in better accuracy than PTQ, but it requires modifications to the training process, making it more complex to implement.
The Ongoing 8-bit Debate
The AI industry has largely settled on two primary candidates for quantized data types: INT8 and FP8, with hardware vendors taking sides. In 2022, a paper from Graphcore and AMD proposed an IEEE standard for FP8, and Intel, Nvidia, and Arm later followed with their own proposals. Other companies, like Qualcomm and Untether AI, have also explored FP8 and compared it to INT8. However, the debate about which data type is best for AI is far from settled. While there’s no universal answer, the choice between INT8 and FP8 often depends on the specific hardware and model architecture, as well as performance and accuracy requirements.
Integer vs. Floating Point
The key difference between floating point and integer data types lies in how they represent numbers. Floating point types are used for real numbers, which include both integers and fractions. These numbers can be represented in scientific notation, with a mantissa and exponent.
On the other hand, integer types are used to represent whole numbers, without fractions. The difference in representation leads to significant variations in precision and range. Floating point numbers have a wider dynamic range than integers, which have a smaller range and fixed precision.
For Training: During training, the primary goal is to update the model’s parameters through optimization, which requires a higher dynamic range to accurately propagate gradients and ensure the model converges. For this reason, floating point representations like FP32, FP16, and even FP8 are typically used during training to maintain sufficient dynamic range.
For Inference: Inference focuses on efficiently evaluating the trained model on new data, with priorities on minimizing computational complexity, memory usage, and energy consumption. In this phase, lower-precision representations like INT8 or FP8 are more suitable, as they reduce the computational burden while still maintaining enough accuracy for real-time performance.
Which Data Type for Inference?
The best data type for inference depends on the application and the hardware used. For real-time and mobile applications, smaller 8-bit data types are often preferred because they reduce memory usage, processing time, and power consumption, while still providing enough accuracy.
FP8 is gaining popularity, with major hardware vendors and cloud service providers integrating it into their deep learning solutions. FP8 comes in several variations, defined by the ratio of exponent bits to mantissa bits. The different types offer a tradeoff between dynamic range and precision. FP8 E3M4, with 3 exponent bits and 4 mantissa bits, has a smaller dynamic range but greater precision. FP8 E4M3 increases the exponent to enhance range, while FP8 E5M2 provides the highest dynamic range, making it ideal for training.
INT8, by comparison, has fewer exponent bits, which limits its dynamic range but improves precision. Whether FP8 or INT8 is better for accuracy depends on the AI model, and power efficiency will depend on the specific hardware. Research from Untether AI suggests that FP8 offers better accuracy and performance, while Qualcomm found that INT8 might be more efficient on their hardware. Ultimately, the choice between FP8 and INT8 comes down to the hardware capabilities and the specific needs of the model.