Clinical named entity recognition from Chinese electronic medical records using a double-layer annotation model combining a domain dictionary with CRF
-
摘要: 醫療實體識別是電子病歷文本信息抽取的基本任務。針對中文電子病歷文本復合實體較多、實體長度較長、句子成分缺失嚴重、實體邊界不清的語言特點以及標注語料難以獲取的現狀,提出了一種基于領域詞典和條件隨機場(CRF)的雙層標注模型。該模型通過對外部資源的統計分析構建醫療領域詞典,再結合條件隨機場,進行了兩次不同粒度的標注,將領域詞典識別的準確性和機器學習的自動性融為一體,從中文電子病歷文本中識別出疾病、癥狀、藥品、操作四類醫療實體。該模型在測試數據中的宏精確率為96.7%、宏召回率為97.7%、宏F1值為97.2%。同時對比分析了采用注意力機制的深度神經網絡的識別效果,因受到領域數據集大小的限制,在該測試數據集中后者表現不佳。實驗結果表明了該雙層標注模型對中文醫療實體識別的高效性。Abstract: As a document recorded by professional medical personnel, electronic medical records contain a large and important clinical resource. How to use a large amount of potential information in electronic medical records has become one of the major research directions. Chinese electronic medical records are knowledge-intensive, in which the data has considerable research value. However, they have more complex entities because of the language features of Chinese, and the composite entity is long. These sentences components in the text are missing. Moreover, the boundaries of clinical entities are often unclear. Labeling corpus is a job that requires a great deal of manpower because of the technical language used in a given text. Therefore, the recognition of Chinese clinical named entities is a hard problem. Considering these characteristics of Chinese electronic medical records, this paper proposed a double-layer annotation model that combined with a domain dictionary and conditional random field (CRF). A medical domain dictionary was constructed by statistical analysis method, and combined with CRF to mark two different granularity labeling operations. The manually constructed medical domain dictionary has extremely high accuracy for the recognition of registered words, and machine learning could automatically recognize unregistered words. This work integrated the two aspects based on these advantages. With the proposed method, diseases, symptoms, drugs, and operations could be recognized from Chinese electronic medical records. Using the test dataset, the Macro-P with 96.7%, the Macro-R with 97.7% and the Macro-F1 with 97.2% were obtained. The recognition performance of the proposed method was greatly improved compared with that of a single-layer model. The recognition effect of deep neural network with attention was also analyzed, which did not perform well due to the size of the domain dataset. The experimental results show the efficiency of the double-layer annotation model for the named entity recognition of Chinese electronic medical records.
-
表 1 訓練集、測試集實體分布情況
Table 1. Distribution of entities among the training set and the test set
Dataset Diseases Symptoms Drugs Operations Total Training set 701 2648 546 2138 6033 Test set 273 1043 208 918 2442 表 2 領域詞典構成情況
Table 2. Distribution among the domain dictionary
Type Diseases Symptoms Operations Drugs Keywords Organs Location Privative Amount 1212 934 611 777 30 351 16 12 表 3 CRF對比實驗結果
Table 3. Comparison experiment results of CRF
% Model Marco-P Marco-R Marco-F1 Baseline(Single-layer CRF) 83.3 68.1 68.1 DLAM 96.7 97.7 97.2 表 4 BiLSTM-Attention-CRF對比實驗結果
Table 4. Comparison experiment results of BiLSTM-Attention-CRF
% Different characters embedding Marco-P Marco-R Marco-F1 Randomly initializes embedding 69.52 69.70 69.38 50-dimension embedding 53.42 54.31 53.74 150-dimension embedding 73.43 77.85 75.54 300-dimension embedding 55.36 61.03 57.88 表 5 DLAM與現有模型結果對比
Table 5. Comparison of DLAM and existing model results
% 259luxu-164 -
參考文獻
[1] Zhang L B. Word Segmentation and Named Entity Mining Based on Semi Supervised Learning for Chinese EMR[Dissertation]. Harbin: Harbin Institute of Technology, 2014張立邦. 基于半監督學習的中文電子病歷分詞和名實體挖掘[學位論文]. 哈爾濱: 哈爾濱工業大學, 2014 [2] Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[J/OL]. arXiv preprint. (2015-08-09) [2019-09-04]. https://arxiv.org/abs/1508.01991 [3] Wang Y Q, Yu Z H, Chen L, et al. Supervised methods for symptom name recognition in free-text clinical records of traditional Chinese medicine: an empirical study. J Biomed Inf, 2014, 47: 91 doi: 10.1016/j.jbi.2013.09.008 [4] Xu Y, Wang Y N, Liu T R, et al. Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries. J Am Med Inf Assoc, 2014, 21(e1): e84 doi: 10.1136/amiajnl-2013-001806 [5] Lei J B, Tang B Z, Lu X Q, et al. A comprehensive study of named entity recognition in Chinese clinical text. J Am Med Inf Assoc, 2014, 21(5): 808 doi: 10.1136/amiajnl-2013-002381 [6] Xu Y, Ge Y Q, Wang Q, et al. Medical name entity recognition and application in Chinese admission record of stroke patients based on CRF and RUTA rule. J Sun Yat-sen Univ Med Sci, 2018, 39(3): 455許源, 葛艷秋, 王強, 等. 基于CRF與RUTA規則相結合的卒中入院記錄醫學實體識別及應用. 中山大學學報(醫學版), 2018, 39(3):455 [7] Zhang X W, Li Z. Chinese electronic medical record named entity recognition based on multi-feature fusion. Softw Guide, 2017, 16(2): 128張祥偉, 李智. 基于多特征融合的中文電子病歷命名實體識別. 軟件導刊, 2017, 16(2):128 [8] Yu L, Jin L Z, Wang M F, et al. Recognition of human hypoxic state based on deep learning. Chin J Eng, 2019, 41(6): 817于露, 金龍哲, 王夢飛, 等. 基于深度學習的人體低氧狀態識別. 工程科學學報, 2019, 41(6):817 [9] Xia Y B, Zhen J L, Zhao Y F, et al. Deep learning based named entity recognition of electronic medical record. Electron Sci Technol, 2018, 31(11): 31夏宇彬, 鄭建立, 趙逸凡, 等. 基于深度學習的電子病歷命名實體識別. 電子科技, 2018, 31(11):31 [10] Li F, Zhang M S, Tian B, et al. Recognizing irregular entities in biomedical text via deep neural networks. Pattern Recognit Lett, 2018, 105: 105 doi: 10.1016/j.patrec.2017.06.009 [11] Liu Z J, Yang M, Wang X L, et al. Entity recognition from clinical texts via recurrent neural networks. BMC Med Inf Decis Making, 2017, 17(Suppl 2): 67 [12] Chowdhury S, Dong X S, Qian L J, et al. A multitask bi-directional RNN model for named entity recognition on Chinese electronic medical records. BMC Bioinf, 2018, 19(Suppl 17): 499 [13] Shen Z. Named Entity Recognition for Chinese Electronic Record with Neural Network[Dissertation]. Beijing: Beijing University of Posts and Telecommunications, 2018申站.基于神經網絡的中文電子病歷命名實體識別[學位論文]. 北京: 北京郵電大學, 2018 [14] Wei Q K, Chen T, Xu R F, et al. Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks. Database, 2016, 2016: baw140 doi: 10.1093/database/baw140 [15] Wu Y H, Yang X, Bian J, et al. Combine factual medical knowledge and distributed word representation to improve clinical named entity recognition. AMIA Annu Symp Proc, 2018, 2018: 1110 [16] Jagannatha A N, Yu H. Bidirectional RNN for medical event detection in electronic health records // Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. California, 2016: 473 [17] Rajkomar A, Oren E, Chen K, et al. Scalable and accurate deep learning with electronic health records[J/OL]. arXiv preprint. (2018-05-11) [2019-09-04]. https://arxiv.org/abs/1801.07860 [18] Wang Y, Wang L, Rastegar-Mojarad M, et al. Clinical information extraction applications: a literature review. J Biomed Inf, 2018, 77: 34 doi: 10.1016/j.jbi.2017.11.011 [19] Luka G, Andrey K, Paul G, et al. Named entity recognition in electronic health records using transfer learning bootstrapped neural networks[J/OL]. arXiv preprint. (2019-07-29) [2019-09-04]. https://arxiv.org/abs/1901.01592 [20] Li W, Zhao D Z, Li B, et al. Combining CRF and rule based medical named entity recognition. Appl Res Comput, 2015, 32(4): 1082 doi: 10.3969/j.issn.1001-3695.2015.04.029栗偉, 趙大哲, 李博, 等. CRF與規則相結合的醫學病歷實體識別. 計算機應用研究, 2015, 32(4):1082 doi: 10.3969/j.issn.1001-3695.2015.04.029 [21] Shi C Y, Xu Z J, Yang X J. Study of TFIDF algorithm. J Comput Appl, 2009, 29(Suppl 1): 167施聰鶯, 徐朝軍, 楊曉江. TFIDF算法研究綜述. 計算機應用, 2009, 29(增刊 1):167 [22] Li H, Statistical learning methods. Beijing: Tsinghua University Press, 2012李航. 統計學習方法. 北京: 清華大學出版社, 2012 [23] Yang J F, Guan Y, He B, et al. Corpus construction for named entities and entity relations on Chinese electronic medical records. J Softw, 2016, 27(11): 2725楊錦鋒, 關毅, 何彬, 等. 中文電子病歷命名實體和實體關系語料庫構建. 軟件學報, 2016, 27(11):2725 [24] Uzuner O, South B R, Shen S Y, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inf Assoc, 2011, 18(5): 552 doi: 10.1136/amiajnl-2011-000203 [25] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J/OL]. arXiv preprint. (2017-12-06) [2019-09-04]. https://arxiv.org/abs/1706.03762 [26] Luo L, Yang Z, Yang P, et al. An attention-based BiLSTM-CRF approach to document level chemical named entity recognition. Bioinformatics, 2018, 34(8): 1381 doi: 10.1093/bioinformatics/btx761 [27] Zhang Y, Wang X W, Hou Z, et al. Clinical named entity recognition from Chinese electronic health records via machine learning methods. JMIR Med Inf, 2018, 6(4): e50 doi: 10.2196/medinform.9965 -