GLIHamba: 基于Mamba的整體–局部上下文圖像和諧化

孫金勝; 潘姣; 郭宇; 姚超

doi:10.13374/j.issn2095-9389.2024.09.12.006

摘要: 近年來，包含Transformer組件的深度學習模型已經推動了包括圖像和諧化在內的圖像編輯任務的快速發展. 與使用靜態局部濾波器的卷積神經網絡(CNN)相反，Transformer使用自注意力機制允許自適應非局部濾波來敏感地捕獲遠程上下文. 現有基于CNN和Transformer等方法圖像和諧化方法，未能很好的兼顧局部內容和整體風格的一致性，導致前景與背景的視覺一致性不足. 本文提出了一種用于圖像和諧化的新型網絡模型，基于Mamba的整體–局部上下文圖像和諧化（Global-local context image harmonization based on Mamba，GLIHamba），將全局特征和局部特征引入到Mamba模型，建立具有整體–局部上下文感知能力的圖像和諧化模型. 具體來說，介紹了一種新的基于學習的圖像和諧化模型GLIHamba，其核心組件包括局部特征序列提取器(LFSE)和全局特征序列提取器(GFSE). LFSE維護圖像高維特征中相鄰特征的局部一致性，顯式地確保空間上鄰近的特征沿著通道保持一致性，從而保證和諧化結果的局部內容完整一致. 另一方面，GFSE在所有空間維度上建立全局序列，保持圖像的整體風格一致性. 研究結果表明，GLIHamba提供了優于最先進的基于CNN和Transformer的方法的性能.

Abstract: Image harmonization is a technique that ensures the consistency and coordination of appearance features, such as lighting and color, between the background and foreground of a composite image. Image harmonization has emerged as a significant research area in the field of image processing. With the rapid development of image processing technologies in recent years, it has gradually become a focal point of attention in both academia and industry. The primary challenge in this research area is the development of image harmonization methods that achieve local content integrity and global style consistency. Traditional image harmonization methods rely primarily on matching low-level features, such as gradients and color histograms, to maintain good color coherence. However, these methods lack semantic awareness of the contextual relationship between the foreground and background, which leads to a lack of realism owing to inconsistencies between content and style. In recent years, harmonization methods based on deep learning have achieved significant progress. Pixel-wise matching methods utilize convolutional encoder-decoder models to learn the transformations from background to foreground pixel features. However, because of the limited receptive fields of convolutional neural networks (CNNs), these methods primarily use local regional features as references, which makes it difficult to incorporate the overall background information into the foreground. In contrast, region-based matching methods treat the foreground and background regions as two different styles or domains. Although these methods achieve global consistency in harmonization results, they often overlook the spatial differences between the two regions. Breakthroughs in state-space models (SSMs), particularly the Mamba model based on the selective state-space model, have brought about significant advancements. The Mamba model utilizes a selective scanning mechanism to achieve linear complexity in capturing global relationships and demonstrated excellent performance in a series of computer vision tasks. However, the Mamba model cannot maintain spatial local dependencies between adjacent features and thus lacks local consistency. In this study, we draw inspiration from the operational methods of CNNs and Transformer models as well as introduce global and local features into the Mamba model to establish an image harmonization model with global-local context awareness. Specifically, we propose a novel learning-based image harmonization model called GLIHamba (Global-local context image harmonization based on Mamba). The core components of GLIHamba include a local feature sequence extractor (LFSE) and global feature sequence extractor (GFSE). The LFSE preserves the locality of adjacent features in high-dimensional arrays to explicitly ensure consistency among spatially neighboring features along the channels and thereby guarantee the local content integrity and consistency of the harmonization results. In contrast, GFSE compresses features across all spatial dimensions to maintain the overall style consistency of the image. Our experimental results demonstrate that the proposed GLIHamba model outperforms previous methods based on CNNs and Transformer in image harmonization tasks. On the iHarmony4 dataset, our model achieved a PSNR value of 39.76 and exhibited excellent performance on real scene data. In summary, the proposed GLIHamba model provides a novel solution to the challenges of image harmonization by integrating global and local context awareness and thus achieves superior performance compared with existing methods.

GLIHamba: 基于Mamba的整體–局部上下文圖像和諧化

GLIHamba: global–local context image harmonization based on Mamba