深度神經網絡模型量化方法綜述

楊春; 張睿堯; 黃瀧; 遆書童; 林金輝; 董志偉; 陳松路; 劉艷; 殷緒成

doi:10.13374/j.issn2095-9389.2022.12.27.004

摘要: 近年來，利用大型預訓練模型來提高深度神經網絡在計算機視覺以及自然語言處理等具體任務下的泛化能力和性能，逐漸成為基于深度學習的人工智能技術與應用的發展趨勢. 雖然這些深度神經網絡模型表現優異，但是由于模型的結構復雜、參數量龐大與計算成本極高，使得它們仍然難以被部署在如家電或智能手機等資源受限的邊緣及端側硬件平臺上，這很大程度上阻礙了人工智能技術的應用. 因此，模型壓縮與加速技術一直都是深度神經網絡模型大規模商業化應用推廣的關鍵問題之一. 當前在多種模型壓縮與加速方案中，模型量化是其中主要的有效方法之一. 模型量化技術可以通過減少深度神經網絡模型參數的位寬和中間過程特征圖的位寬，從而達到壓縮加速深度神經網絡的目的，使量化后的網絡能夠部署在資源有限的邊緣設備上，然而，由于量化會導致信息的大量丟失，如何在保證模型任務精度條件下實現模型量化已經成為熱點問題. 另外，因硬件設備以及應用場景的不同，模型量化技術已經發展成為一個多分支的研究問題. 通過全面地調研不同角度下模型量化相關技術現狀，并且深入地總結歸納不同方法的優缺點，可以發現量化技術目前仍然存在的問題，并為未來可能的發展指明方向.

Abstract: The study of deep neural networks has recently gained widespread attention in recent years, with many researchers proposing network structures that exhibit exceptional performance. A current trend in artificial intelligence (AI) technology involves using deep learning and its applications via large-scale pretrained deep neural network models. This approach aims to improve the generalization capability and task-specific performance of the model, particularly in areas such as computer vision and natural language processing. Despite their success, the deployment of high-performance deep neural network models on edge hardware platforms, such as household appliances and smartphones, remains challenging owing to the high complexity of the neural network architecture, substantial storage overhead, and computational costs. These factors hinder the availability of AI technologies to the public. Therefore, compressing and accelerating deep neural network models have become a critical issue in the promotion of their large-scale commercial applications. Owing to the growing support for low-precision computation technology provided by AI hardware manufacturers, model quantization has emerged as a promising approach for the compression and acceleration of machine learning models. By reducing the bit width of deep neural network model parameters and intermediate feature maps during the forward propagation of the model, memory usage, computation efficiency, and energy consumption can be substantially reduced, enabling the utilization of quantized deep neural network models in resource-limited edge devices. However, this approach involves a critical tradeoff between task performance and hardware deployment, which directly impacts its potential for practical application. Quantizing the model to a low-bit precision can lead to considerable information loss, often resulting in a catastrophic degradation of the task performance of the model. Thus, alleviating the challenges of model quantization while maintaining task performance has become a critical research topic in AI. Furthermore, because of the differences in hardware devices, constraints of application scenarios, and data accessibility, model quantization has become a multibranch problem, including data-dependent, data-free, mixed-precision, and extremely low-bit quantization, among others. By comprehensively investigating various quantization methods for deep neural networks proposed based on different perspectives, and summarizing their advantages and disadvantages thoroughly, the essential problems that are associated with the quantization of deep neural network quantization can be explored, which points out the directions for possible future developments.

深度神經網絡模型量化方法綜述

A survey of quantization methods for deep neural networks