嵌入共識知識的因果圖文檢索方法

梁彥鵬; 劉雪兒; 馬忠貴; 李卓

doi:10.13374/j.issn2095-9389.2023.05.28.001

摘要: 跨模態圖像?文本檢索是一項在給定一種模態（如文本）的查詢條件下檢索另一種模態（如圖像）的任務. 該任務的關鍵問題在于如何準確地測量圖文兩種模態之間的相似性，在減少視覺和語言這兩種異構模態之間的視覺語義差異中起著至關重要的作用. 傳統的檢索范式依靠深度學習提取圖像和文本的特征表示，并將其映射到一個公共表示空間中進行匹配. 然而，這種方法更多地依賴數據表面的相關關系，無法挖掘數據背后真實的因果關系，在高層語義信息的表示和可解釋性方面面臨著挑戰. 為此，在深度學習的基礎上引入因果推斷和嵌入共識知識，提出嵌入共識知識的因果圖文檢索方法. 具體而言，將因果干預引入視覺特征提取模塊，通過因果關系替換相關關系學習常識因果視覺特征，并與原始視覺特征進行連接得到最終的視覺特征表示. 為解決本方法文本特征表示不足的問題，采用更強大的文本特征提取模型BERT（Bidirectional encoder representations from transformers，雙向編碼器表示），并且嵌入兩種模態數據之間共享的共識知識對圖文特征進行共識級的表示學習. 在MS-COCO數據集以及MS-COCO 到Flickr30k上的跨數據集實驗，證明了本文方法可以在雙向圖文檢索任務上實現召回率和平均召回率的一致性改進.

Abstract: Crossmodal image-text retrieval involves retrieving relevant images or texts based on a query condition from the opposite modality. Its primary challenge lies in precisely quantifying the similarity metric used for feature matching between the two distinct modalities, playing an important role in mitigating the visual-semantic disparities between the heterogeneous realms of visual and linguistic domains. It has extensive applications in domains such as e-commerce product search and medical image retrieval. Traditional retrieval paradigms depend on harnessing deep learning techniques for extracting feature representations from images and texts. Crossmodal image-text retrieval learns semantic feature representations of disparate modal data by harnessing the formidable feature–extraction ability, subsequently mapping them into a shared semantic space for semantic alignment. However, this approach primarily depends on superficial data correlations, lacking the capacity to reveal the latent causal relationships underpinning the data. Moreover, owing to the inherent “black-box” nature of deep learning, the interpretability of model predictions often eludes human comprehension. In addition, an undue reliance on training data distributions impairs the generalization performance of the model. Consequently, the existing methods suffer the challenge of representing high-level semantic insights while maintaining interpretability. Causal inference, which endeavors to ascertain the causal effect of specific phenomena by isolating confounding factors by means of intervention, presents a novel avenue for enhancing the generalization capability and interpretability of deep models. Recently, researchers have sought to combine visual and linguistic tasks with the principles of causal inference. Accordingly, we introduce causal inference and embeds consensus knowledge into the bedrock of deep learning, and a novel causal image-text retrieval methodology with embedded consensus knowledge is proposed. Specifically, causal intervention is introduced into the visual feature extraction module, replacing correlated relationships with causal counterparts to cultivate common causal visual features. These features are then fused with the primal visual features acquired through bottom-up attention, resulting in a definitive visual feature representation. This study adopts the potent textual feature extraction ability of bidirectional encoder representations from transformers to address the shortfall in textual feature representation. Shared consensus knowledge between the two modal data is entwined, allowing for consensus-level feature representation learning image-text features. Empirical validation on the dataset MS-COCO and crossdataset experiments on the dataset Flickr30k substantiate the capacity of the proposed method to consistently enhance recall and mean recall in bidirectional image-text retrieval tasks. In summary, this pioneering approach endeavors to bridge the gap between visual and textual representations by combining causal inference principles and shared consensus knowledge within the framework of deep learning, thereby promising enhanced generalization and interpretability.

嵌入共識知識的因果圖文檢索方法

Causal image-text retrieval embedded with consensus knowledge