基于漸進機器學習的中文問句匹配方法

賀學劍; 陳安琪; 郭志強; 王致茹; 陳群

doi:10.13374/j.issn2095-9389.2023.11.05.002

摘要: 問句匹配旨在判斷不同問句的意圖是否相近. 近年來，隨著大型預訓練語言模型的發展，利用其挖掘問句對在語義層面隱含的匹配信息，取得了目前為止最好的性能. 然而，由于基于獨立同分布假設，在真實場景中，這些深度學習模型的性能仍然受制于訓練數據的充足程度和目標數據與訓練數據之間的分布漂移. 本文提出一種基于漸進機器學習的中文問句匹配方法. 該方法基于漸進機器學習框架，從不同角度提取問句特征，構建融合各類特征信息的因子圖，然后通過迭代的因子推理實現從易到難的漸進學習. 在特征建模中，設計并實現了兩種類型特征的提取：（1）基于TF-IDF（Term frequency-inverse document frequency）的關鍵詞特征；（2）基于DNN（Deep neural network）的深度語義特征. 最后，通過通用的基準中文數據集LCQMC和BQ corpus驗證了所提方法的有效性. 實驗表明，相比于單純的深度學習模型，基于漸進機器學習的方法可以有效提升問句匹配的準確率，且其性能優勢隨著標簽訓練數據的減少而增大.

Abstract: Question matching attempts to determine whether the intentions of two different questions are similar. Recently, with the development of large-scale pretrained DNN (Deep neural network) language models, state-of-the-art question-matching performance has been achieved. However, due to the independent and identically distributed assumption, the performance of these DNN models in real-world scenarios is limited by the adequacy of the training data and the distribution drift between the target and training data. In this study, we propose a novel gradual machine learning (GML)-based approach for Chinese question matching. Beginning with initially labeled instances, this approach gradually labels target instances in order of increasing hardness via iterative factor inference on a factor graph. The proposed solution first extracts diverse semantic features from different perspectives and then constructs a factor graph by fusing the extracted features to facilitate gradual learning from easy to hard. In feature modeling, we extract and model two complementary types of features: 1) TF-IDF-based keyword features, which can capture the shallow semantic similarity between two questions; 2) DNN-based deep semantic features, which can capture the latent semantic similarity between two questions. We model keyword features as unary factors in a factor graph, which define their influence on the matching status of the two questions. The DNN-based features contain global and local features, where the global features correspond to a question pair’s matching probability as estimated by a DNN model, and the local features correspond to the semantic similarity between two neighboring question pairs estimated by their vector representations in a DNN’s embedding space. To facilitate gradual inference, we model the DNN-based global and local features as unary and binary factors, respectively, in a factor graph. Finally, we implement a GML solution for question matching based on an open-sourced GML inference engine. We validated the efficacy of the proposed approach through a comparative study on two open-sourced Chinese benchmark datasets, LCQMC and the BQ corpus. Extensive experiments demonstrate that compared with pure deep learning models, the proposed solution effectively improves the accuracy of question matching, and its performance advantage generally increases with a decrease in labeled training data. Our experiments also demonstrate that the performance of the proposed solution is very robust w.r.t key algorithmic parameters, indicating its applicability in real-world scenarios. In addition, our work on the GML solution is orthogonal to existing deep learning-based question-matching algorithms because our solution can easily accommodates and leverages other deep language models.

基于漸進機器學習的中文問句匹配方法

Question-matching approach based on gradual machine learning