Shichen Peng
Shichen Peng
发布于 2023-07-13 / 23 阅读
0
0

DNN Quantization Basics

DNN Quantization Basics

This tutorial contains the basic concepts of quantization in deep learning and the steps of how to transfer an existing network to a quantized version. Some Quantization-related APIs will also be introduced.

What is Quantization

Quantization is a method to convert the data in DNNs from floating point data to integer data which can relax the requirement of memory bandwidth and computing resources during inference.

Why Quantization

The commonly seen data format in DNN training is Float32 which uses 32 bits to represent a real number. However, in DNNs, thanks to the strong robustness brought by millions or even billions of parameters, it does not need such high precision to represent a single number in the storage or inference of DNNs. Quantization technology used here will bring many advantages while only affect a little bit on the performance of DNNs.

How to Realize Quantization

The way to realize quantization can be interpreted from two aspects. One is the way to quantize the whole network, the other is the way how to quantize a tensor.

Ways of Quantize a Network

There are three ways to handle this task:

  • Weight-Only Quantization (WOQ)

  • Post-Training Dynamic Quantization (PTDQ)

  • Post-Training Static Quantization (PTSQ)

  • Quantization-Aware Training (QAT)

Weight-Only Quantization (WOQ)

WOQ is the most basic method to quantize a network. It will only modify the storage format. After loading the weights, all the later computations will still be in floating-point numbers.

Post-Training Dynamic Quantization (PTDQ)

This is the easiest way to accomplish quantization. In this method, weights will be stored in quantized value and the matrix multiplication between input and weights will also be handled in a quantized manner. However, activations will be dynamically quantized during the inference or just not be quantized, and the outputs will still be in the original data format. The following graph shows the whole procedure.

# Original model
# All tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                 /
linear_weight_fp32

# Dynamically quantized model
# Linear and LSTM weights are in int8
previous_layer_fp32 -- linear_int8_w_fp32_inp -- activation_fp32 -- next_layer_fp32
                     /
   linear_weight_int8

This method is suitable for networks that need massive memory and bandwidth to store and transfer weights while the burden of computation is not that large. LSTM and Transformer models are suitable for this method.

Post-Training Static Quantization (PTSQ)

PTSQ is another way to quantize a pre-trained network. It acts like Full-Stack Quantization. In this method, all the weights, internal data, and I/O data are all represented in the quantized format like the graph shown below:

# Original model
# All tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                    /
    linear_weight_fp32

# Statically quantized model
# Weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                    /
  linear_weight_int8

This method is typically used when both memory bandwidth and compute savings are important. It is suitable for CNNs and GANs. Also, it is the best way for hardware realization.

Quantization-Aware Training (QAT)

No matter how well the quantization technology is, it will still affect the accuracy of networks because of physics laws. For the areas where accuracy is vitally important, quantization-aware training can be brought into view. It relies on software-simulated quantization during fine-tuning to recover the accuracy of quantized models. The procedure is shown before.

# Original model
# All tensors and computations are in floating point
previous_layer_fp32 -- linear_fp32 -- activation_fp32 -- next_layer_fp32
                      /
    linear_weight_fp32

# Model with fake_quants for modeling quantization numerics during training
previous_layer_fp32 -- fq -- linear_fp32 -- activation_fp32 -- fq -- next_layer_fp32
                           /
   linear_weight_fp32 -- fq

# Quantized model
# Weights and activations are in int8
previous_layer_int8 -- linear_with_activation_int8 -- next_layer_int8
                     /
   linear_weight_int8

Ways of Quantizing a Tensor

There are two different ways to describe how a tensor is quantized, from the perspective of distribution and from the perspective of scale.

Chosen by Data Distribution

According to the distribution of the temsor which needs to be quantized, there are two methods based on the choice of constant Z:

  • Symmetric Quantization - Z=0, suitable for symmetric distribution like Weight.
  • Affine Quantization - Z\ne 0, suitable for assymetric distribution like Bias.

Inapproately choosing the method to quantize a tensor will lead to huge loss of precision duo to the waste of dynamic range. Here is a set of graph to visualize the two quantization method:

Quantization-of-Different-Distribution

Chosen by Quantization Scale

Using a single pair of (S, Z) may loosing precision when data distributions are diversed between different channels. Some tensors like weights are likely to have this property. In such a situation, we can choose to use quantization by channal instead of quantization by tensor.

Quantization-of-Different-Scale

Quantization Method Choosing Strategy

To offer guidance on how to choose an appropriate way to quantize a network, here is a flow diagram that may give some help:

Quantization-Method-choosing-Strategy


评论