目標檢測模型的類別語義和全局關系蒸餾

梁彥鵬; 馬忠貴; 王宗杰; 李卓

doi:10.13374/j.issn2095-9389.2024.04.25.001

摘要: 知識蒸餾是一種將知識從復雜的教師模型轉移到輕量級學生模型的模型壓縮技術. 現有的為分類任務而設計的知識蒸餾方法在目標檢測任務上的性能并不理想，僅觀察到微小的改進. 與分類任務相比，目標檢測任務會在自然圖像中同時定位和分類多個目標對象，這些目標對象往往具有不同的尺度和外觀以及復雜的類間關系，并且分布在不同的位置，導致目標中心或周圍區域以及前景和背景區域等都有可能對蒸餾有不同的貢獻，知識在檢測任務中變得相當模糊和不平衡，這使得目標檢測中的知識蒸餾變得非常具有挑戰性. 為解決這個問題，本文提出一種新的基于注意力的目標檢測知識蒸餾框架——類別語義和全局關系蒸餾，前者關注每個類別目標的關鍵前景位置，后者捕捉各類別目標像素之間的全局長遠距離依賴關系. 為驗證本文所提出方法的有效性和泛化性，分別在SODA10M、PASCAL VOC和MiniCOCO三個具有挑戰性的數據集基準上進行了實驗. 在多種目標檢測器上，經過蒸餾之后的學生模型都取得了較大的性能改進. 對于RetinaNet ResNet-50，其mAP在SODA10M數據集上提升了4.67，其AP₅₀在PSACAL VOC數據集上提升了2.64.

Abstract: Object detection, a fundamental task in computer vision, has witnessed remarkable success in domains such as autonomous driving, robotics, and facial recognition, owing to advancements in convolutional neural networks. Despite these successes, state-of-the-art models for object detection often come with a high number of parameters, pushing the limits of modern hardware and posing challenges for deployment on devices with limited resources. To address this challenge, various model compression techniques have been developed, including network pruning, lightweight architecture design, neural network quantization, and knowledge distillation. Knowledge distillation stands out as it transfers knowledge from large teacher models to compact student models without modifying the network structure, enabling the student models to perform nearly as well as their larger counterparts. However, most distillation techniques have been optimized for image classification, not object detection, which involves simultaneously detecting and classifying multiple target objects within natural images. These objects often exhibit variations in scale, intricate interclass relationships, and are dispersed across different locations. These factors make it difficult to balance the contributions of different elements, such as bounding box centers and backgrounds during distillation. Consequently, incorporating knowledge distillation into object detection models poses substantial challenges. To settle these questions, this study proposes a novel attention-based knowledge distillation framework for object detection, striking a better balance between efficiency and accuracy. This study is primarily divided into the following points: first, it introduces the use of category semantic attention to accurately identify and focus on foreground semantic regions for each class in the neck feature pyramid’s output feature map of the teacher detector. This process effectively communicates crucial positional information data for each class to the student model and helps manage challenges related to multiscale targets. To mitigate differences between teacher and student model feature maps, this study normalizes the feature maps used for distillation, ensuring they have zero mean and unit variance. Furthermore, to improve the handling of background information in category semantic distillation and tackle issues related to the disrupted relationships between foreground and background regions as well as overlooked relationships among different class targets, this study proposes a criss-cross attention mechanism. This mechanism is designed to capture long-range dependencies between target pixels in the teacher model, which are then transmitted to the student model to further enhance its detection capabilities. Combining the aforementioned two distillation techniques, this study introduces the category semantic and global relation (CSGR) distillation approach. The first technique targets crucial foreground positions for each class, whereas the second captures global relationships among target pixels across different classes. To validate the effectiveness and generalization of the proposed method, extensive experiments were conducted on challenging benchmarks, including SODA10M, PASCAL VOC, and MiniCOCO. Across various object detectors, the student models distilled through CSGR distillation exhibit impressive improvements compared with those trained from scratch. Compared with other baseline methods, the proposed approach achieved competitive improvements in mean average precision without considerably increasing the number of parameters and FLOPS during distillation training, thereby striking a better balance between accuracy and efficiency.

目標檢測模型的類別語義和全局關系蒸餾

Category semantic and global relation distillation for object detection