融合多特征嵌入與注意力機制的中文電子病歷命名實體識別

鞏敦衛; 張永凱; 郭一楠; 王斌; 樊寬魯; 火焱

doi:10.13374/j.issn2095-9389.2021.01.12.006

融合多特征嵌入與注意力機制的中文電子病歷命名實體識別

Named entity recognition of Chinese electronic medical records based on multifeature embedding and attention mechanism

摘要

摘要: 中文電子病歷文本包含大量嵌套實體、句子語法結構復雜、句式偏短。為有效識別其醫療實體，提出一種融合多特征嵌入與注意力機制的命名實體識別算法，在輸入表示層融合字符、單詞、字形三個粒度的特征，并在雙向長短期記憶網絡的隱含層引入注意力機制，使算法在捕獲特征時更加關注于醫療實體相關的字符，最終實現對中文電子病歷中疾病、身體部位、癥狀、藥物、操作五類實體的最優標注。面向開源和自建糖尿病數據集的實驗結果中所提算法的實體識別準確率、召回率和F1值都達到97%以上，表明其可以更加有效地識別中文電子病歷中各類實體。

Abstract: Medical records, as an essential part of the health care records of residents, save all the information about the clinical treatment of patients, which are traditionally written by doctors on paper. With the development of information technologies, electronic medical records that are more easily saved and managed gradually replace the traditional ones. Intelligent auxiliary diagnosis, patients’ portrait construction, and disease prediction based on medical reports have become research hotspots in the field of intelligent medical care. To fully discover the hidden relationship between symptoms and diseases from the documents saved in electronic medical records, the development of an efficient named entity recognition algorithm is the key issue. Although several studies have been conducted on it, there is relatively little research on the information extraction of Chinese electronic medical records. To the best of our knowledge, the documents in Chinese electronic medical records contain a large number of nested named entities and short sentences. Moreover, there is weak logic among the sentences, causing a complex syntax structure. To effectively recognize the medical entities, a novel named entity recognition method based on multifeature embedding and attention mechanism was proposed. After embedding three types of features derived from characters, words, and glyphs in the input presentation layer, an attention machine was introduced to the hidden layer of the bidirectional long short-term memory network to make the model focus on the characters related to the medical entities. Finally, the optimal labels for the five types of entities in Chinese electronic medical records, including diseases, body parts, symptoms, drugs, and operations, were obtained. The experimental results for the open and self-built Chinese electronic medical records, recognition accuracy, recall rate, and F1 value of the proposed algorithm are all better than 97%, which shows that the proposed algorithm can effectively identify various entities in Chinese electronic medical records.

HTML全文

參考文獻(26)

施引文獻

資源附件(0)