基于特征增強和對齊融合的可見光–紅外圖像融合目標檢測

代小波; 任思羽; 何小海; 卿粼波; 陳洪剛

doi:10.13374/j.issn2095-9389.2025.03.26.004

基于特征增強和對齊融合的可見光–紅外圖像融合目標檢測

Visible–infrared fusion object detection based on feature enhancement and alignment fusion

摘要

摘要: 可見光–紅外圖像融合目標檢測可以通過融合不同模態的信息來實現更好的檢測性能. 然而，來自不同模態的特征圖的精確對齊以及這些不同模態信息的高效融合仍然是可見光–紅外圖像融合目標檢測中的關鍵挑戰. 為了解決這些問題，本文提出了一種名為F3M-Det的可見光–紅外圖像融合目標檢測方法，該方法利用跨模態特征增強、對齊和融合，實現了可見光–紅外圖像融合目標檢測. 首先，設計了一個基于跨模態交叉注意力的特征增強模塊，插入骨干網絡的各個階段，引導骨干網絡關注目標的所在區域，提升了來自兩種模態的特征表示能力. 其次，設計了一個全局到局部的特征對齊模塊，從全局到局部漸進地對齊了來自兩種模態的特征圖，使模型適用于非對齊可見光–紅外圖像的場景. 最后，提出了基于頻率感知的特征融合模塊，克服了來自不同模態的特征圖差異，有效地融合了紅外和可見光的多模態特征. 本文在常用數據集DVTOD和LLVIP上進行了全面的實驗評估. 與其他算法的對比實驗結果表明，所提出的方法優于現有的可見光–紅外圖像融合目標檢測方法，消融實驗結果進一步證明了所設計的各個模塊的有效性.

Abstract: Single-modal object detectors have developed rapidly and have achieved remarkable results in recent years. However, these detectors still exhibit significant limitations, primarily because they cannot leverage the complementary information intrinsic to multimodal images. Visible–infrared object detection technology addresses challenges such as poor visibility under low-light conditions by fusing information from visible and infrared images to exploit complementary features across the two modalities. However, the precise alignment of feature maps from different modalities and efficient fusion of modality-specific information remain key challenges in this field. Although various methods have been proposed to address these issues, effectively handling modality differences, enhancing the complementarity of cross-modal information, and achieving efficient feature fusion continue to be bottlenecks for high-performance object detectors. To overcome these challenges, this study proposes a visible–infrared object detection method called F3M-Det. This method significantly improves detection performance by enhancing, aligning, and fusing cross-modal features. The core concept of F3M-Det is to fully leverage the complementarity between visible and infrared images, thereby enhancing the model’s ability to understand and process cross-modal information. Specifically, the core components of the proposed F3M-Det consist of a feature extraction backbone, feature enhancement module (FEM), feature alignment module (FAM), and feature fusion module (FFM). The FEM utilizes cross-modal attention mechanisms to significantly enhance the expressive power of both visible and infrared image features. By effectively capturing subtle differences and complementary information between the modalities, the FEM enables F3M-Det to achieve higher detection accuracy. To reduce the computational cost of calculating the cross-attention on global feature maps while retaining the useful features of the input feature maps, the FEM employs a multiscale feature-pooling method to reduce the dimensionality of the feature maps. Next, the FAM is introduced to effectively align feature maps from different modalities. The FAM combines global information with local details to ensure that features captured from different perspectives and scales are accurately aligned. This approach reduces the modality differences and improves the comparability of cross-modal information. The design of the FAM allows the model to effectively handle misalignments between modalities in complex environments, thereby enhancing the robustness and generalization ability of the F3M-Det. Finally, the FFM is introduced to achieve the efficient fusion of cross-modal features. The FFM incorporates frequency-aware mechanisms to reduce irrelevant modality differences during feature fusion while preserving useful complementary information, thereby enhancing the effectiveness of the fused features. The FFM is also used as a cross-scale feature fusion module (SFFM) to reduce information loss. F3M-Det uses YOLOv5 as its baseline. In terms of structure, it builds a dual-stream backbone network using CSPDarknet, integrating the FPN structure and detection head from YOLOv5. To validate the effectiveness of the proposed F3M-Det, we conducted comprehensive experimental evaluations on two widely used datasets: the unaligned DVTOD and aligned LLVIP datasets. The experimental results show that F3M-Det outperforms existing visible–infrared image object detection methods on both datasets, demonstrating its superiority in handling cross-modal feature alignment and fusion. Additionally, ablation experiments were conducted to investigate the impact of each module on F3M-Det’s performance. The results demonstrate the importance of each proposed module in enhancing detection accuracy, further validating the effectiveness and superiority of F3M-Det.

HTML全文

參考文獻(33)

施引文獻

資源附件(0)