Artificial intelligence (AI) has become an integral part of our everyday lives, with AI-powered services and products seeing a huge surge in demand. This has been especially true for large language models like ChatGPT and image generation tools like Stable Diffusion. However, this rise in popularity has also led to a closer examination of the computational and environmental costs, particularly in the area of deep learning.
The main factors contributing to the high costs of deep learning are the size and complexity of the models, the type of processor used, and the way data is represented. Over the past decade, the size of AI models has been growing rapidly, with compute requirements doubling every 6 to 10 months. While processor power has improved, it hasn’t kept pace with the rising costs of the latest AI models. This has prompted researchers to explore ways to optimize data representation, as choosing the right data type can significantly affect a model’s power consumption, accuracy, and throughput. However, the ideal data type for AI depends on whether you’re in the training phase or the inference phase of deep learning.
Finding the Balance: Bit by Bit
To make AI more efficient, one approach is to reduce the number of bits used to represent the data, a process known as quantization. By lowering the number of bits, you not only make the model smaller but also reduce computation time, which in turn reduces the power needed for processing. This is an important technique for anyone working on efficient AI systems.
AI models are typically trained using 32-bit floating point (FP32) data, but it turns out that not all 32 bits are necessary to maintain accuracy. Using 16-bit floating point (FP16) data types has shown promise, leading to efforts to find the minimum number of bits required for a model to remain accurate. Google developed the 16-bit brain float (BF16), and for models set up for inference, they are often quantized to 8-bit floating point (FP8) or integer (INT8) data types. There are two main methods for quantizing a neural network: Post-Training Quantization (PTQ) and Quantization-Aware Training (QAT). Both aim to improve computational efficiency, memory usage, and energy consumption, but they differ in how they apply quantization and how it affects model accuracy.
Post-Training Quantization (PTQ) is applied after a model has been trained with higher-precision data (like FP32 or FP16). This process reduces the model’s precision by converting its weights and activations to lower-precision formats, like FP8 or INT8. While PTQ is relatively simple to implement, it can result in accuracy loss, particularly in low-precision formats, as the model wasn’t trained to handle these quantization errors.
Quantization-Aware Training (QAT), on the other hand, incorporates quantization during the training process itself, allowing the model to adapt to lower precision. By simulating quantized operations during training, the model learns to handle the reduced precision more effectively. While QAT typically results in better accuracy than PTQ, it requires changes to the training process and can be more complex to implement.
The 8-bit Debate
In the AI industry, two primary data types have emerged as contenders for quantization: INT8 and FP8. Different hardware vendors have shown strong preferences for one or the other. In mid-2022, Graphcore and AMD proposed an IEEE standard for FP8, and shortly after, Intel, Nvidia, and Arm joined with similar proposals. Other companies, including Qualcomm and Untether AI, have also explored the merits of FP8 versus INT8. While the debate is ongoing, the choice between these two data types largely depends on the specific AI model and the hardware used.
Integer vs. Floating Point
The distinction between floating point and integer data types lies in how they represent numbers. Floating point data types are used to represent real numbers, including both integers and fractions, and can be written in scientific notation, with a mantissa and exponent.
Integer data types, on the other hand, are used to represent whole numbers, with no fractions involved. This difference in representation means floating point numbers have a wider dynamic range, while integer numbers have less range but more precision.
Integer vs Floating Point for Training
During the training phase of deep learning, the focus is on optimizing the model’s parameters, and this requires a higher dynamic range to accurately propagate gradients and achieve convergence. As such, floating point representations like FP32, FP16, and even FP8 are preferred during training to maintain a sufficient range.
Integer vs Floating Point for Inference
The inference phase is about efficiently applying the trained model to new data. In this phase, the focus shifts to minimizing computational complexity, memory usage, and energy consumption. This is where lower-precision data types like INT8 and FP8 come into play. For real-time applications and mobile services, the smaller INT8 data type is often the best choice, as it reduces memory and compute time while still offering enough accuracy for effective results.
FP8 and INT8 for Inference
FP8 is becoming more widely adopted, and major hardware vendors and cloud providers are incorporating it into their deep learning platforms. There are several versions of FP8, with varying trade-offs between precision and dynamic range. FP8 E3M4, for example, has a smaller dynamic range but higher precision, while FP8 E4M3 offers a greater dynamic range by sacrificing some precision. FP8 E5M2 has the highest dynamic range, making it ideal for training, which requires a larger range.
INT8, by contrast, has a smaller dynamic range but more precision, with 1 sign bit, 1 exponent bit, and 6 mantissa bits. Whether FP8 or INT8 is better for a specific model depends on the hardware and the performance goals. Research from Untether AI suggests that FP8 outperforms INT8 in terms of accuracy, performance, and efficiency on their hardware. On the other hand, Qualcomm has found that while FP8 may offer higher accuracy, it doesn’t justify the loss in efficiency compared to INT8 for their hardware.
Ultimately, the decision of which data type to use for quantization in inference depends on several factors: the model’s requirements, the hardware capabilities, and the trade-offs between accuracy and efficiency.