基于多任務特征融合與正交注意力的交通環境感知算法

李正峰; 鐘銘恩; 張億鴻; 范康; 鄧智穎; 譚佳威

doi:10.13374/j.issn2095-9389.2024.10.09.001

基于多任務特征融合與正交注意力的交通環境感知算法

Traffic environment perception algorithm based on multi-task feature fusion and orthogonal attention

摘要

摘要: 自動駕駛場景中的協同多任務感知算法設計目前仍頗具挑戰，針對此提出一種深度卷積神經網絡算法MTEPN，用于同時完成車輛目標檢測、可行駛路面區域提取和車道線分割三項視覺任務. 首先采用CSPDarkNet網絡提取交通場景圖像的基礎特征；其次設計了特征聚合模塊C2f-K來獲得更具細粒度的全局圖像特征；隨后提出正交注意力HWAttention降低計算量并增強空間尺度圖像特征；并引進跨任務信息聚合模塊CFAS實現任務間互補模式信息的融合；最后通過解耦的任務頭模塊來實現這三個感知目標. 在BDD100k公開數據集上的實驗結果表明，所提算法的平均目標檢測精度mAP和可行駛區域提取的像素平均交并比mIOU分別為79.4%和92.4%，均超越同參數規模的主流多任務感知算法，車道線分割精度IoU為次優值27.2%，模型參數量僅為7.9M，單幀圖像處理時間僅為24.3 ms，具有較好的綜合性能. 相關代碼將在https://github.com/XMUT-Vsion-Lab/MTEPN公開.

Abstract: In the realm of autonomous driving, the design and implementation of collaborative multi-task perception algorithms pose significant challenges. These challenges are primarily rooted in the need for real-time processing speeds, effective feature sharing among diverse tasks, and seamless information fusion. Addressing these concerns is critical for enhancing the overall safety and efficiency of autonomous systems navigating complex traffic environments. Therefore, we propose MTEPN as an innovative deep convolutional neural network algorithm specifically designed to perform multiple visual tasks concurrently. This framework aims to achieve three essential objectives: vehicle target detection, extraction of drivable road areas, and segmentation of lane lines. By integrating these tasks into a unified model, MTEPN enhances the perceptual capabilities of autonomous driving systems and improves their ability to operate effectively in real-world settings. MTEPN is built upon the CSPDarkNet network, which is employed to extract fundamental features from traffic scene images. By leveraging a horizontal connection mechanism, this network enhances the feature extraction capabilities of the model, establishing a robust basis for subsequent multi-task processing. This initial step is crucial, as high-quality feature extraction determines the overall performance of the entire system. Subsequently, a multi-channel deformable feature aggregation module, termed C2f-K, is proposed. This module is designed to capture fine-grained global image features by facilitating cross-layer information fusion. By integrating features across different scales, C2f-K effectively reduces background noise and interference, thereby improving the understanding of complex scenes of the model. To further enhance the efficiency and accuracy of the model, an orthogonal attention mechanism called HWAttention is proposed. This mechanism minimizes computational load while amplifying significant spatial features within the input images. By selectively focusing on critical areas of interest, HWAttention significantly boosts the performance of the model across various environments, ensuring that it remains efficient even under real-time constraints. A notable advancement introduced in MTEPN is the cross-task feature aggregation structure. This module promotes information complementarity between tasks by implicitly modeling the global context relationships among different visual tasks. The integration of complementary pattern information deepens feature sharing, thereby enhancing the recognition accuracy of each task. This approach fosters a synergistic relationship among the tasks, enabling the model to operate more effectively than traditional methods, which treat tasks in isolation. Additionally, the decoupled task head module allows for independent processing of the three perceptual objectives. This design choice not only increases the flexibility of the model but also sharpens the focus on each task, allowing for tailored optimization strategies that enhance overall performance. Through experimental evaluations conducted on the BDD100k public dataset, MTEPN achieved impressive results, with an average mean average precision (mAP) of 79.4% for vehicle target detection and an average intersection-over-union (mIoU) of 92.4% for pixels extracted from the drivable area. Both these metrics surpass those of existing mainstream multi-task perception algorithms with comparable parameter scales. Furthermore, the lane line segmentation accuracy, measured by IoU, reached a sub-optimal value of 27.2%. Importantly, MTEPN maintains a modest parameter count of only 7.9 million and processes single-frame images in just 24.3 ms. This efficiency demonstrates its suitability for real-time applications in autonomous driving, where both speed and accuracy are paramount. The relevant code for this innovative algorithm will be made publicly available at https://github.com/XMUT-Vsion-Lab.

HTML全文

參考文獻(26)

施引文獻

資源附件(0)