照度條件自適應的粒度漸進多模態圖像融合方法

王傳云; 孫冬冬; 周明奇; 王田; 高騫; 李照奎

doi:10.13374/j.issn2095-9389.2024.10.01.001

照度條件自適應的粒度漸進多模態圖像融合方法

Illumination-adaptative granularity progressive multimodal image fusion method

摘要

摘要: 為應對光照條件復雜多變下的多場景視覺感知挑戰，本文提出了一種照度條件自適應的粒度漸進多模態圖像融合方法. 首先，設計了基于大模型的場景信息嵌入模塊，通過預訓練的圖像編碼器對輸入的可見光圖像進行場景建模，并利用不同的線性層對場景向量進行處理. 隨后，利用處理后的場景向量對圖像重建階段的圖像特征在通道維度上進行調控，使得融合模型能夠根據不同的場景光照生成不同風格的融合圖像. 其次，為了解決現有特征提取模塊在特征表達上的不足，本文設計了基于狀態空間方程的特征提取模塊，以線性復雜度實現全局特征感知，減少了信息傳輸過程中的信息丟失，提升了融合圖像的視覺效果. 最后，設計了粒度漸進融合模塊，利用狀態空間方程對多模態特征進行全局聚合，并引入跨模態坐標注意力機制對聚合后的特征進行精細調優，從而實現多模態特征從全局到局部的多階段融合，增強了網絡的信息整合能力. 在訓練過程中，本文采用先驗知識生成增強圖像作為標簽，并根據不同環境構建同源與異構的損失函數，以實現場景自適應的多模態圖像融合. 實驗結果顯示，本文方法在暗光場景數據集MSRS和LLVIP、混合光照數據集TNO、連續場景數據集RoadScene以及霧霾場景數據集M3FD上的表現均優于11種先進算法，在定量和定性對比中取得了更好的視覺效果和更高的定量指標. 所提出的方法在自動駕駛、軍事偵察和環境監控等任務中展現出較大的潛力.

Abstract: To address the challenges of multi-scene visual perception under complex and fluctuating lighting conditions, this study proposes a novel illumination condition-adaptive granularity progressive multimodal image fusion method. Visual perception in environments with varying lighting, such as urban areas at night or during harsh weather conditions, presents significant challenges for traditional imaging systems. This method integrates advanced techniques to ensure robust image fusion that dynamically adapts to different scene characteristics. First, a large model-based scene information embedding module is designed to effectively capture scene context from the input visible light image. This module leverages a pretrained image encoder to model the scene, generating scene vectors that are processed through various linear layers. The processed scene vectors are then progressively embedded into the fusion image reconstruction network, providing the fusion model with the ability to perceive scene information. This integration allows the fusion network to adjust its behavior according to contextual lighting conditions, resulting in more accurate image fusion. To overcome the limitations of existing feature extraction methods, an innovative feature extraction module based on state-space equations is proposed. This module enables global feature perception with linear computational complexity, minimizing the loss of critical information during transmission. The proposed feature extraction method enhances the visual quality of the fused images by reducing information loss and preserving the clarity of the reconstructed images. This approach maintains visual fidelity even under challenging lighting conditions, making it well-suited for dynamic environments. Finally, a granularity progressive fusion module is introduced. This module first employs state-space equations to globally aggregate multimodal features, then applies a cross-modal coordinate attention mechanism to fine-tune the aggregated features. This approach enables multi-stage fusion, from global to local granularity, enhancing the model’s ability to integrate information across various modalities. The multistage fusion process improves the coherence and detail of the output image, facilitating better scene interpretation and boosting model performance. During the training phase, prior knowledge is used to generate augmented images as pseudo-labels. Homogeneous and heterogeneous loss functions are constructed based on different environmental conditions, enabling adaptive learning. This method optimizes the performance of scene-adaptive multimodal image fusion by adjusting the fusion model to varying illumination conditions. Experimental results demonstrate the effectiveness of the proposed method. Extensive experiments across several benchmark datasets—including MSRS and LLVIP for dark-light scenarios, TNO for mixed lighting conditions, RoadScene for continuous scenes, and M3FD for hazy conditions—show that the proposed method outperforms 11 state-of-the-art algorithms in qualitative and quantitative evaluations. The method achieves superior visual effects and higher quantitative metrics across all test scenarios, demonstrating its robustness and versatility. Furthermore, when compared with a two-stage method, the proposed approach still outperforms it in terms of visual effects and quantitative metrics. The proposed scene-adaptive fusion framework holds significant potential for applications in fields such as autonomous driving, military reconnaissance, and environmental surveillance, where reliable visual perception under complex lighting conditions is essential. These results highlight the method’s promise for real-world tasks involving dynamic lighting changes, setting a new benchmark in multimodal image fusion.

HTML全文

參考文獻(28)

施引文獻

資源附件(0)