基于領域詞典與CRF雙層標注的中文電子病歷實體識別

龔樂君; 張知菲

doi:10.13374/j.issn2095-9389.2019.09.04.004

基于領域詞典與CRF雙層標注的中文電子病歷實體識別

doi: 10.13374/j.issn2095-9389.2019.09.04.004

龔樂君^{1, 2, ,},
張知菲^{1, 2}

1.
南京郵電大學計算機學院、軟件學院、網絡空間安全學院，南京 210023
2.
江蘇省大數據安全與智能處理重點實驗室，南京 210023

基金項目: 國家自然科學基金資助項目（61502243，61502247，61572263）；浙江省智慧醫療工程技術研究中心資助項目（2016E10011）；中國博士后基金資助項目（2018M632349）；江蘇省高校自然科學基金資助項目（16KJB520003）

詳細信息

通訊作者:
E-mail：glj98226@163.com

中圖分類號: TP391.1
計量
- 文章訪問數: 1937
- HTML全文瀏覽量: 2105
- PDF下載量: 92
- 被引次數: 0
出版歷程
- 收稿日期: 2019-09-04
- 刊出日期: 2020-04-01

Clinical named entity recognition from Chinese electronic medical records using a double-layer annotation model combining a domain dictionary with CRF

GONG Le-jun^{1, 2
, ,},
ZHANG Zhi-fei^{1, 2}

1.
School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing 210023, China
2.
Jiangsu Key Lab of Big Data Security & Intelligent Processing, Nanjing 210023, China

More Information

Corresponding author: E-mail: glj98226@163.com

摘要

摘要: 醫療實體識別是電子病歷文本信息抽取的基本任務。針對中文電子病歷文本復合實體較多、實體長度較長、句子成分缺失嚴重、實體邊界不清的語言特點以及標注語料難以獲取的現狀，提出了一種基于領域詞典和條件隨機場（CRF）的雙層標注模型。該模型通過對外部資源的統計分析構建醫療領域詞典，再結合條件隨機場，進行了兩次不同粒度的標注，將領域詞典識別的準確性和機器學習的自動性融為一體，從中文電子病歷文本中識別出疾病、癥狀、藥品、操作四類醫療實體。該模型在測試數據中的宏精確率為96.7%、宏召回率為97.7%、宏F1值為97.2%。同時對比分析了采用注意力機制的深度神經網絡的識別效果，因受到領域數據集大小的限制，在該測試數據集中后者表現不佳。實驗結果表明了該雙層標注模型對中文醫療實體識別的高效性。
- 中文電子病歷 /
- 醫療實體識別 /
- 領域詞典 /
- 條件隨機場 /
- 注意力機制
Abstract: As a document recorded by professional medical personnel, electronic medical records contain a large and important clinical resource. How to use a large amount of potential information in electronic medical records has become one of the major research directions. Chinese electronic medical records are knowledge-intensive, in which the data has considerable research value. However, they have more complex entities because of the language features of Chinese, and the composite entity is long. These sentences components in the text are missing. Moreover, the boundaries of clinical entities are often unclear. Labeling corpus is a job that requires a great deal of manpower because of the technical language used in a given text. Therefore, the recognition of Chinese clinical named entities is a hard problem. Considering these characteristics of Chinese electronic medical records, this paper proposed a double-layer annotation model that combined with a domain dictionary and conditional random field (CRF). A medical domain dictionary was constructed by statistical analysis method, and combined with CRF to mark two different granularity labeling operations. The manually constructed medical domain dictionary has extremely high accuracy for the recognition of registered words, and machine learning could automatically recognize unregistered words. This work integrated the two aspects based on these advantages. With the proposed method, diseases, symptoms, drugs, and operations could be recognized from Chinese electronic medical records. Using the test dataset, the Macro-P with 96.7%, the Macro-R with 97.7% and the Macro-F1 with 97.2% were obtained. The recognition performance of the proposed method was greatly improved compared with that of a single-layer model. The recognition effect of deep neural network with attention was also analyzed, which did not perform well due to the size of the domain dataset. The experimental results show the efficiency of the double-layer annotation model for the named entity recognition of Chinese electronic medical records.
- Chinese electronic medical records /
- clinical named entity recognition /
- medical domain dictionary /
- conditional random field /
- attention

HTML全文

圖 1 基于領域詞典與CRF的雙層標注模型

Figure 1. Double-layer annotation model

下載: 全尺寸圖片幻燈片

圖 2 DLAM與BiLSTM-Attention-CRF實體級別精確率對比

Figure 2. DLAM and BiLSTM-Attention-CRF precision comparison on entity

下載: 全尺寸圖片幻燈片

圖 3 DLAM與BiLSTM-Attention-CRF實體級別召回率對比

Figure 3. DLAM and BiLSTM-Attention-CRF recall comparison on entity

下載: 全尺寸圖片幻燈片

表 1 訓練集、測試集實體分布情況

Table 1. Distribution of entities among the training set and the test set

Dataset	Diseases	Symptoms	Drugs	Operations	Total
Training set	701	2648	546	2138	6033
Test set	273	1043	208	918	2442

下載: 導出CSV

表 2 領域詞典構成情況

Table 2. Distribution among the domain dictionary

Type	Diseases	Symptoms	Operations	Drugs	Keywords	Organs	Location	Privative
Amount	1212	934	611	777	30	351	16	12

下載: 導出CSV

表 3 CRF對比實驗結果

Table 3. Comparison experiment results of CRF %

Model	Marco-P	Marco-R	Marco-F1
Baseline（Single-layer CRF）	83.3	68.1	68.1
DLAM	96.7	97.7	97.2

下載: 導出CSV

表 4 BiLSTM-Attention-CRF對比實驗結果

Table 4. Comparison experiment results of BiLSTM-Attention-CRF %

Different characters embedding	Marco-P	Marco-R	Marco-F1
Randomly initializes embedding	69.52	69.70	69.38
50-dimension embedding	53.42	54.31	53.74
150-dimension embedding	73.43	77.85	75.54
300-dimension embedding	55.36	61.03	57.88

下載: 導出CSV

表 5 DLAM與現有模型結果對比

Table 5. Comparison of DLAM and existing model results %

Model	Marco-P	Marco-R	Marco-F1
CRF_multi-features^[27]	92.03	87.09	89.49
BiLSTM-CRF^[27]	91.12	89.74	90.43
DLAM	96.70	97.70	97.20

下載: 導出CSV

259luxu-164

參考文獻(27)

[1]	Zhang L B. Word Segmentation and Named Entity Mining Based on Semi Supervised Learning for Chinese EMR[Dissertation]. Harbin: Harbin Institute of Technology, 2014 張立邦. 基于半監督學習的中文電子病歷分詞和名實體挖掘[學位論文]. 哈爾濱: 哈爾濱工業大學, 2014
[2]	Huang Z H, Xu W, Yu K. Bidirectional LSTM-CRF Models for Sequence Tagging[J/OL]. arXiv preprint. (2015-08-09) [2019-09-04]. https://arxiv.org/abs/1508.01991
[3]	Wang Y Q, Yu Z H, Chen L, et al. Supervised methods for symptom name recognition in free-text clinical records of traditional Chinese medicine: an empirical study. J Biomed Inf, 2014, 47: 91 doi: 10.1016/j.jbi.2013.09.008
[4]	Xu Y, Wang Y N, Liu T R, et al. Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries. J Am Med Inf Assoc, 2014, 21(e1): e84 doi: 10.1136/amiajnl-2013-001806
[5]	Lei J B, Tang B Z, Lu X Q, et al. A comprehensive study of named entity recognition in Chinese clinical text. J Am Med Inf Assoc, 2014, 21(5): 808 doi: 10.1136/amiajnl-2013-002381
[6]	Xu Y, Ge Y Q, Wang Q, et al. Medical name entity recognition and application in Chinese admission record of stroke patients based on CRF and RUTA rule. J Sun Yat-sen Univ Med Sci, 2018, 39(3): 455 許源, 葛艷秋, 王強, 等. 基于CRF與RUTA規則相結合的卒中入院記錄醫學實體識別及應用. 中山大學學報(醫學版), 2018, 39(3):455
[7]	Zhang X W, Li Z. Chinese electronic medical record named entity recognition based on multi-feature fusion. Softw Guide, 2017, 16(2): 128 張祥偉, 李智. 基于多特征融合的中文電子病歷命名實體識別. 軟件導刊, 2017, 16(2):128
[8]	Yu L, Jin L Z, Wang M F, et al. Recognition of human hypoxic state based on deep learning. Chin J Eng, 2019, 41(6): 817 于露, 金龍哲, 王夢飛, 等. 基于深度學習的人體低氧狀態識別. 工程科學學報, 2019, 41(6):817
[9]	Xia Y B, Zhen J L, Zhao Y F, et al. Deep learning based named entity recognition of electronic medical record. Electron Sci Technol, 2018, 31(11): 31 夏宇彬, 鄭建立, 趙逸凡, 等. 基于深度學習的電子病歷命名實體識別. 電子科技, 2018, 31(11):31
[10]	Li F, Zhang M S, Tian B, et al. Recognizing irregular entities in biomedical text via deep neural networks. Pattern Recognit Lett, 2018, 105: 105 doi: 10.1016/j.patrec.2017.06.009
[11]	Liu Z J, Yang M, Wang X L, et al. Entity recognition from clinical texts via recurrent neural networks. BMC Med Inf Decis Making, 2017, 17(Suppl 2): 67
[12]	Chowdhury S, Dong X S, Qian L J, et al. A multitask bi-directional RNN model for named entity recognition on Chinese electronic medical records. BMC Bioinf, 2018, 19(Suppl 17): 499
[13]	Shen Z. Named Entity Recognition for Chinese Electronic Record with Neural Network[Dissertation]. Beijing: Beijing University of Posts and Telecommunications, 2018 申站.基于神經網絡的中文電子病歷命名實體識別[學位論文]. 北京: 北京郵電大學, 2018
[14]	Wei Q K, Chen T, Xu R F, et al. Disease named entity recognition by combining conditional random fields and bidirectional recurrent neural networks. Database, 2016, 2016: baw140 doi: 10.1093/database/baw140
[15]	Wu Y H, Yang X, Bian J, et al. Combine factual medical knowledge and distributed word representation to improve clinical named entity recognition. AMIA Annu Symp Proc, 2018, 2018: 1110
[16]	Jagannatha A N, Yu H. Bidirectional RNN for medical event detection in electronic health records // Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. California, 2016: 473
[17]	Rajkomar A, Oren E, Chen K, et al. Scalable and accurate deep learning with electronic health records[J/OL]. arXiv preprint. (2018-05-11) [2019-09-04]. https://arxiv.org/abs/1801.07860
[18]	Wang Y, Wang L, Rastegar-Mojarad M, et al. Clinical information extraction applications: a literature review. J Biomed Inf, 2018, 77: 34 doi: 10.1016/j.jbi.2017.11.011
[19]	Luka G, Andrey K, Paul G, et al. Named entity recognition in electronic health records using transfer learning bootstrapped neural networks[J/OL]. arXiv preprint. (2019-07-29) [2019-09-04]. https://arxiv.org/abs/1901.01592
[20]	Li W, Zhao D Z, Li B, et al. Combining CRF and rule based medical named entity recognition. Appl Res Comput, 2015, 32(4): 1082 doi: 10.3969/j.issn.1001-3695.2015.04.029 栗偉, 趙大哲, 李博, 等. CRF與規則相結合的醫學病歷實體識別. 計算機應用研究, 2015, 32(4):1082 doi: 10.3969/j.issn.1001-3695.2015.04.029
[21]	Shi C Y, Xu Z J, Yang X J. Study of TFIDF algorithm. J Comput Appl, 2009, 29(Suppl 1): 167 施聰鶯, 徐朝軍, 楊曉江. TFIDF算法研究綜述. 計算機應用, 2009, 29(增刊 1):167
[22]	Li H, Statistical learning methods. Beijing: Tsinghua University Press, 2012 李航. 統計學習方法. 北京: 清華大學出版社, 2012
[23]	Yang J F, Guan Y, He B, et al. Corpus construction for named entities and entity relations on Chinese electronic medical records. J Softw, 2016, 27(11): 2725 楊錦鋒, 關毅, 何彬, 等. 中文電子病歷命名實體和實體關系語料庫構建. 軟件學報, 2016, 27(11):2725
[24]	Uzuner O, South B R, Shen S Y, et al. 2010 i2b2/VA challenge on concepts, assertions, and relations in clinical text. J Am Med Inf Assoc, 2011, 18(5): 552 doi: 10.1136/amiajnl-2011-000203
[25]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[J/OL]. arXiv preprint. (2017-12-06) [2019-09-04]. https://arxiv.org/abs/1706.03762
[26]	Luo L, Yang Z, Yang P, et al. An attention-based BiLSTM-CRF approach to document level chemical named entity recognition. Bioinformatics, 2018, 34(8): 1381 doi: 10.1093/bioinformatics/btx761
[27]	Zhang Y, Wang X W, Hou Z, et al. Clinical named entity recognition from Chinese electronic health records via machine learning methods. JMIR Med Inf, 2018, 6(4): e50 doi: 10.2196/medinform.9965