Quantization Algorithm Provided by Google
This document provides an overview of the shortcomings of previous quantization methods and how the new standard provided by Google works.
Problems of Previous Methods
In the begging when the idea that quantization can be used in DNNs was proposed, only weights data were quantized. This method will reduce the pressure of memory, but not benefit the computing efficiency. Targeting the computing pressure of massive floating-point multiplications, binary, ternary, and bit-shift networks are also proposed. However, these modifications almost can only speed up processing speed on specific hardware with highly customized architecture. Some designs even go to the extreme and choose to use 1-bit representations which may affect the accuracy to a huge extent.
Scheme of Google’s Design
Mathematical Derivative
The training process of DNN will not be quantized and will in Google’s design, only inference needs to be quantized. The quantization is purely based on affine mapping without any use of LUT which performs poorly compared to the arithmetic method with the use of SIMD hardware.
Their definition of quantization is:
r = S(q-Z)
where r and q represent the real value and the quantized value, respectively. S and Z are two constants in the real domain which are scale factor and zero point. One thing to notice is that all the values in a tensor will share a single pair of (S, Z).
Then we can represent a matrix with the data structure below:
template<typename QType> // e.g. QType=uint8 struct QuantizedBuffer
{
vector<QType> q; // the quantized values
float S; // the scale
QType Z; // the zero-point
};
The matrix can be represented as:
r_{\alpha}^{(i, j)}=S_{\alpha}(q_{\alpha}^{(i, j)}-Z_{\alpha})
So that matrix multiplication becomes:
S_{3}(q_{3}^{(i, k)}-Z_{3})= \sum_{j=1}^{N}S_{1}(q_{1}^{(i, j)}-Z_{1})S_{2}(q_{2}^{(j, k)}-Z_{2})
We can get the quantized result q_3^{(i,k)} as:
q_{3}^{(i, k)}=Z_{3}+M \sum_{j=1}^{N}(q_{1}^{(i, j)}-Z_{1})(q_{2}^{(j, k)}-Z_{2})
Where:
M:= \frac{S_{1}S_{2}}{S_{3}}
Since q1, q2, q3 and Z_1, Z_2 are all quantized, only M is still a floating-point number, based on the empirical result, M always belongs to (0,1) , so we can continue to quantize M by:
M=2^{-n}M_{0}
Where M_0 is in the interval [0.5, 1) and n is non-negtive.
We can furtherly expand the equation and obtain:
q_{3}^{(i, k)}=Z_{3}+M\left(NZ_{1}Z_{2}-Z_{1}a_{2}^{(k)} -Z_{2}\bar{\alpha}_{1}^{(i)}+\sum_{j=1}^{N}q_{1}^{(i, j)}q_{2}^{(j, k)}\right)
Where:
a_{2}^{(k)}:= \sum_{j=1}^{N}q_{2}^{(j, k)},\ \overline{a}_{1}^{(i)}:=\sum_{j=1}^{N}q_{1}^{(i, j)}
Thus, the maximum complexity is reduced to O(n^2)
If symmetric quantization is used which means Z=0, the result can be further reduced to:
q_{3}^{(i, k)}=M (q_{1}^{(i, j)}q_{2}^{(j, k)})
Quantization Width Selection
Width of Biases
Since one bias value will be added to many output activations, a slight error will be passed extensively. Based on this fact, they choose to use 32-Bits integers to represent bias to ensure maximum precision. This design requires the addition buffer to be at least 32 bits, which is also the width of the multiplication result buffer.
Width of Weights and Activations
According to the fact above, the width of weights and activations are mapped to 8-bit unsigned integer, with the multiplication result being cut off to 32 bits. Thus, the sum is in the form:
\text{int}\ 32 \ += \text{uint} \ 8 \times \ \text{uint}\ 8
When symmetric quantization is used:
S_{\text{bias}}=S_{1}S_{2}, Z_{\text{bias}}=0
Layer Fusion
Quantization can be easily used to fuse previous layers and activation layers like ReLU and ReLU6. The only thing need to do is to set the constant pair (S,Z) such that values outside the threshold will naturally overflow.
Quantization Simulation in Training
Cause of Accuracy Decadence
The traditional way of obtaining a quantized network is to train the network naturally in floating point and then convert it to a quantized version. Large models with a massive number of parameters have strong robustness to resist corruption. However, smaller networks may be affected a lot by its accuracy.
Additional distribution related results will also have an impact on model performance, and they are:
- Divergence of distribution in different channels leads to inaccuracy of single pair of (S,Z) representation.
- Few Outlier values expand the representation range which shrinks the resolution of quantization.
Training with Quantization
The training will be handled in the following rules:
- Activations quantization will be quantized at where they would be.
- Weights will be normalized before quantized and then input.
- Bias will not be quantized since 32-bit values have high enough precision.
To software simulated quantize a tensor, the following algorithm will be used:
\begin{align*} \text{clamp} (r;a,\ b) &:= \min(\max(x,\ \alpha),\ b)\\ s(a, \ b,\ n)&:=\frac{b-a}{n-1} \tag{12}\\ a q(r;a,\ b,\ n) &:=\left \lfloor\frac{\text{clamp}(r;a, b)-a}{s(a, b, n)}\right\rceil s(a,\ b,\ n)+a, \end{align*}
It can be described as the steps below:
- Clamp the tensor by the boundary (a,b).
- Calculate the scaling factor.
- Add offset to the rounded scaled value.
Quantization Range
Range for Weights
It will be symmetrically quantized to int8 values of [-127, 127].
-128 is not used for some tweaks shown in Appendix B, but I cannot find Appendix B on IEEE Xplore.
Range for Activations
The range of activations will be decided via exponential moving averages (EMA) during training. However, EMA will not be applied for the first 50 thousand or first 2 million steps in order to converge faster.
Quantized Batch Normalization
If batch normalization is applied to a model, it will be folded into the weights and biases as the equation shown below:
w_{\text{fold}}:= \frac{\gamma w}{\sqrt{\sigma_{B}^{2}+\mathcal{E}}}
Where \gamma is the batch normalization’s scale parameter, \sigma_B^2 is the estimate of the variance of convolution results across the batch, and \varepsilon is just a small constant for numerical stability.
Training Steps
The training will be conducted in the following manner:
- Create a standard model.
- Insert simulated quantization operators.
- Do simulated training and get quantization parameters.
- Convert the model to a quantized version.
- Run quantized inference.