-
摘要: 基于語義角色分析,提出了一種三元組涉恐事件實體屬性抽取方法,為網絡空間涉恐活動的監測及預警提供技術支持。首先,基于西北政法大學“反恐怖主義信息網”文本語料數據進行數據采集和清洗等預處理工作,采用樸素貝葉斯文本分類算法識別涉恐事件文本,并采用關鍵詞提取算法TF-IDF(Term frequency-inverse document frequency,詞頻-逆文檔頻率)構建涉恐專有詞庫,結合自然語言處理技術構建帶詞性的涉恐專有詞庫。然后通過語義角色分析、句法依存分析,提取了主語謂語賓語關系、定語后置動賓關系、人名//地名//機構和介賓關系主謂動補4類涉恐三元組結構。最后,利用正則表達式及帶詞性的涉恐專有名詞分析,在4類三元組短文本中提取出恐怖事件發生時間、發生地點、傷亡情況、攻擊方式、武器類型和恐怖組織6類實體屬性。對采集的4221篇文章數據進行實驗分析,6類實體屬性抽取的測評結果F1值均超過80%,對網絡空間的涉恐事件監測及預警,維護社會公共安全具有重要現實意義。Abstract: Affected by complex international factors in recent years, terrorism events are increasingly rampant in many countries, thereby posing a great threat to the gloal community. In addition, with the widespread use of emerging technologies in military and commercial fields, terrorist organizations have begun to use emerging technologies to engage in destructive activities. As the Internet and information technology develop, terrorism has been rapidly spreading in cyberspace. Terrorist organizations have created terrorism websites, established multinational networks of terrorist organizations, released recruitment information and even conducted training activities through various mainstream websites with a worldwide reach. Compared with traditional terrorist activities, cyber terrorist activities have a greater degree of destructiveness. Cybercrime and cyber terrorism have become the most serious challenges for societies. Terrorist organizations take advantage of the Internet in rapid dissemination of extremism ideas, and develop a large number of terrorists and supporters around the world, especially in developed Western countries. Terrorist organizations even use the Internet and “dark net” networks to conduct terrorist training, and their activities are concealed. As a result, the "lone wolf" terrorist attacks in various countries have emerged in an endless stream, which is difficult to prevent. This study proposed a method of extracting entities and attributes of terrorist events based on semantic role analysis, and provided technical support for monitoring and predicting cyberspace terrorism activities. Firstly, a naive Bayesian text classification algorithm is used to identify terrorism events on the cleaned text corpus collected from the Anti-Terrorism Information Site of the Northwest University of Political Science and Law. The keyword extraction algorithm TF-IDF is adopted for constructing the terrorism vocabularies from the classified text corpus, combining natural language processing technology. Then, semantic role and syntactic dependency analyses are conducted to mine the attributive post-targeting relationship, the name//place name//organization, and the mediator-like relationship. Finally, regular expressions and constructed lexical terrorism-specific vocabularies are used to extract six entities and attributes (occurrence time, occurrence location, casualties, attack methods, weapon types and terrorist organizations) of terrorism event based on the four types of triad short texts. The F1 values of the six types of entity attribute extraction evaluation results exceeded 80% based on the experimental data of 4221 articles collected. Therefore, the method proposed has practical significance for maintaining social public safety because of the positive effect in monitoring and predicting cyberspace terrorism events.
-
Key words:
- entity extraction /
- semantic role analysis /
- triples /
- naive Bayes /
- text categorization
-
表 1 語義角色分析實例
Table 1. Semantic role analysis example
Techniques 0 1 2 3 4 5 6 7 8 WS 阿富汗 首都 爆炸 襲擊 造成 至少 4 人 死亡 POS ns n v v v d m n v DP 2:ATT 4:ATT 4:ATT 5:SBV 0:HED 8:ATT 8:ATT 9:SBV 5:VOB SRL 4 A0:(0,3)A1:(5,8) 8 A1:(5,7) 表 2 訓練測試數據概覽
Table 2. Training test data overview
Area US and Europe Asia-Pacific Middle East Central and South Asia West Asia and Africa Number of texts in corpus 14110 3513 11169 3178 10251 表 3 事件發生基準時間樣例
Table 3. Sample time base for event occurrence
Type Sample Post time “作者:來源:新華社 發布時間:2019年02月14日 點擊數:1”; Report time “新華社內羅畢2月13日$ {\simfont\text{電}}\cdots\cdots $司令部13日下午證${\simfont\text{實}}\cdots\cdots $
美軍11日${\simfont\text{在}}\cdots\cdots $”表 4 主語謂語賓語關系三元組提取示例
Table 4. Example of subject predicate object relation triplet extraction
No. Sentence Triples 1 巴基斯坦卡拉奇南部發生一起恐怖襲擊 巴基斯坦卡拉奇南部,發生,一起恐怖襲擊 2 美國駐塔吉克斯坦領事館遭多名武裝分子襲擊 美國駐塔吉克斯坦領事館,遭,襲擊 3 也門胡塞武裝分子當天凌晨向沙特吉贊省發射炮彈 也門胡塞武裝分子,發射,炮彈 4 巴加索拉鎮一個市場當天遭極端組織“博科圣地”爆炸襲擊 巴加索拉鎮一個市場,遭,極端組織博科圣地 5 北約車隊當天在阿東部遭遇自殺式爆炸襲擊 北約車隊,遭遇,自殺式爆炸 6 埃及西奈半島北部城市阿里什一酒店24日遭自殺式炸彈襲擊 埃及西奈半島北部城市阿里什一酒店,遭,自殺式炸彈襲擊 7 塔利班6日晚在阿富汗西部巴德吉斯省再次發動襲擊 塔利班,發動,襲擊 8 也門南部一警察基地15日發生自殺式恐怖襲擊事件 也門南部一警察基地,發生,自殺式恐怖襲擊事件 9 兩名女性自殺式襲擊者客在尼日利亞東北部一處擁擠的巿集引爆炸彈 兩名女性自殺式襲擊者客,引爆,炸彈 10 黎巴嫩首都貝魯特南郊的一處繁華區域發生自殺式炸彈襲擊 黎巴嫩首都貝魯特南郊一處繁華區域,發生,自殺式炸彈襲擊 表 5 定語后置動賓關系三元組提取示例
Table 5. Example of attributive post-action binary triad extraction
No. Sentence Triples 1 靠近土耳其邊境的一個難民營進行了空襲 一個難民營,靠近,土耳其邊境 2 位于埃及北部城市坦塔的一所教堂9日發生爆炸 一所教堂,位于,埃及北部城市坦塔 3 恐怖分子在敘利亞古城阿勒頗發射了裝有有毒物質的炸彈 炸彈,裝有,有毒物質 4 來自浙江的游客陳云華在泰國警察總醫院里見到新華社記者時仍驚魂未定 游客,來自,浙江 5 警方稱此次事件為“嚴重的恐怖主義”事件 事件,為,嚴重恐怖主義 6 德國北部城市呂貝克一輛公交車上發生持刀行兇案件 行兇案件,持,刀 7 自2015年11月來自比利時布魯塞爾莫倫貝克區的恐怖分子在法國巴黎制造血腥恐襲 恐怖分子,來自,比利時布魯塞爾莫倫貝克區 8 在馬里東北部遭遇“伊斯蘭支持者”組織的埋伏 埋伏,遭遇,伊斯蘭支持者組織 9 襲擊目標是駐阿外國軍隊車輛 外國軍隊車輛,駐,阿 10 造成包括6名美軍士兵在內的13人喪生 13人,包括,6名美軍士兵 表 6 人名//地名//機構三元組提取示例
Table 6. Name / / place name / / organization triplet extraction example
No. Sentence Triples 1 伊北部薩拉赫丁省首府提克里特市一街區4日晚遭武裝分子襲擊 薩拉赫丁省,首府,提克里特市 2 伊拉克首都巴格達24日發生一起自殺式爆炸襲擊事件 伊拉克,首都,巴格達 3 伊中部費盧杰市17日晚發生自殺式爆炸襲擊 伊,中部費,盧杰市 4 敘利亞城市哈德爾發生自殺式爆炸襲擊 敘利亞,城市,哈德爾 5 敘利亞沿海城市塔爾圖斯和杰卜萊23日遭到多起爆炸襲擊 敘利亞,沿海城市,塔爾圖斯 6 喀布爾機場附近在阿富汗副總統杜斯塔姆抵達后不久發生爆炸 阿富汗,副總統,杜斯塔姆 7 聯合國秘書長潘基文發表聲明嚴辭譴責 聯合國,秘書長,潘基文 8 尼日利亞國家緊急事務管理局官員薩托米·艾哈邁德10日對媒體說 尼日利亞國家緊急事務管理局,官員,薩托米·艾哈邁德 9 土耳其舍爾納克省國會議員費薩爾·薩雷伊德斯發表聲明稱 土耳其舍爾納克省國會,議員,費薩爾·薩雷伊德斯 10 俾路支省內政部長薩爾夫拉茲·布格蒂告訴記者 俾路支省,內政部長,薩爾夫拉茲·布格蒂 表 7 介賓關系主謂動補三元組提取示例
Table 7. Example of the introduction of the mediation of the mediators
No. Sentence Triples 1 目前爆炸死亡人數已經由45人升至52人 爆炸人數,升至,52人 2 爆炸發生在巴格達西部一個什葉派聚居區 爆炸,發生在,巴格達西部一個什葉派聚居區 3 這些伊拉克戰斗人員死于IS的襲擊 這些伊拉克戰斗人員,死于,IS 4 總部設在英國倫敦的敘利亞人權觀察組織8月1日晚發布聲明稱 總部,設在,英國倫敦 5 在俄羅斯和敘利亞的官員證實停火已擴大到阿勒頗市僅幾小時后 停火,擴大到,阿勒頗市 6 從敘利亞境內極端組織“伊斯蘭國”控制地區發射的5枚火箭彈
當天上午落在基利斯市組織伊斯蘭國控制地區發射5枚火箭彈,
落在,基利斯市7 爆炸發生于該醫院急診部的入口處 爆炸,發生于,該醫院急診部入口處 8 對峙持續至當地時間29號早晨 對峙,持續至,當地時間29號早晨 9 莫斯科就發生一起汽車撞向行人的事故 汽車,撞向,行人 10 兩起襲擊,發生在,極北大區靠近尼日利亞邊境科拉瓦鎮 兩起襲擊,發生在,極北大區靠近尼日利亞邊境科拉瓦鎮 表 8 實體屬性抽取評測結果
Table 8. Entityraction evaluation result
% Entity attribute Precision Recall F1 Occurrence time 100 93.3 96.5 Occurrence location 86.3 89.5 87.9 Attack method 84.3 84.9 84.6 Weapon type 81.2 81.3 81.4 Terrorist organization 79.7 82.8 81.2 Casualties 100 91.2 95.4 259luxu-164 -
參考文獻
[1] Li P F, Zhou G D, Zhu Q M. Semantics-based joint model of Chinese event trigger extraction. J Softw, 2016, 27(2): 280李培峰, 周國棟, 朱巧明. 基于語義的中文事件觸發詞抽取聯合模型. 軟件學報, 2016, 27(2):280 [2] He R F, Duan S Y. Joint Chinese event extraction based multi-task learning. J Softw, 2019, 30(4): 1015賀瑞芳, 段紹楊. 基于多任務學習的中文事件抽取聯合模型. 軟件學報, 2019, 30(4):1015 [3] Tian S W, Zhou X F, Yu L, et al. Causal relation extraction of Uyghur events based on bidirectional long short-term memory model. J Electron Inf Technol, 2018, 40(1): 200 doi: 10.11999/JEIT170402田生偉, 周興發, 禹龍, 等. 基于雙向LSTM的維吾爾語事件因果關系抽取. 電子與信息學報, 2018, 40(1):200 doi: 10.11999/JEIT170402 [4] Zhang S R, Luo C. Event extraction technology by semantic role analysis. J Terahertz Sci Electron Inf Technol, 2017, 15(2): 279 doi: 10.11805/TKYDA201702.0279章順瑞, 駱陳. 基于語義角色分析的事件抽取技術. 太赫茲科學與電子信息學報, 2017, 15(2):279 doi: 10.11805/TKYDA201702.0279 [5] Chen X X, Liu B. Extracting open domain events in microblogs. Comput Appl Softw, 2016, 33(8): 18 doi: 10.3969/j.issn.1000-386x.2016.08.004陳簫簫, 劉波. 微博中的開放域事件抽取. 計算機應用與軟件, 2016, 33(8):18 doi: 10.3969/j.issn.1000-386x.2016.08.004 [6] Qin B, Liu A A, Liu T. Unsupervised Chinese open entity relation extraction. J Comput Res Dev, 2015, 52(5): 1029 doi: 10.7544/issn1000-1239.2015.20131550秦兵, 劉安安, 劉挺. 無指導的中文開放式實體關系抽取. 計算機研究與發展, 2015, 52(5):1029 doi: 10.7544/issn1000-1239.2015.20131550 [7] Hou W T, Ji D H. Research on clinic event recognition based Bi-LSTM. Appl Res Comput, 2018, 35(7): 1974 doi: 10.3969/j.issn.1001-3695.2018.07.011侯偉濤, 姬東鴻. 基于Bi-LSTM的醫療事件識別研究. 計算機應用研究, 2018, 35(7):1974 doi: 10.3969/j.issn.1001-3695.2018.07.011 [8] Li W J, Li T, Xi F. Chinese entity relation extraction based on multi-features self-attention Bi-LSTM. J Chin Inf Process, 2019, 33(10): 47 doi: 10.3969/j.issn.1003-0077.2019.10.006李衛疆, 李濤, 漆芳. 基于多特征自注意力BLSTM的中文實體關系抽取. 中文信息學報, 2019, 33(10):47 doi: 10.3969/j.issn.1003-0077.2019.10.006 [9] Zhang J F. Sentiment analysis of teaching evaluation based on improved naive Bayes algorithm. Mod Comput, 2018(11): 3張俊飛. 基于改進樸素貝葉斯算法實現評教評語情感分析. 現代計算機: 中旬刊, 2018(11):3 [10] Yu T, Wang H Y. Text information extraction based on TF-IDF algorithm. Sci Technol Vision, 2018(16): 117于韜, 王洪巖. 基于TF-IDF算法的文本信息提取. 科技視界, 2018(16):117 [11] Wu Z Q, Huang X J, Wu L D. Question-focused summarization based on semantic relational triple. Comput Eng, 2008, 34(6): 194 doi: 10.3969/j.issn.1000-3428.2008.06.070吳中勤, 黃萱菁, 吳立德. 基于語義關系三元組的問答式文摘. 計算機工程, 2008, 34(6):194 doi: 10.3969/j.issn.1000-3428.2008.06.070 [12] Pu W Y. Research on user-specific theme web crawler technology for private information acquisition. Software dev appl, 2019(1): 33 doi: 10.3969/j.issn.1006-4052.2019.01.010蒲文瑩. 面向專用信息獲取的用戶定制主題網絡爬蟲技術探究. 電腦編程技巧與維護, 2019(1):33 doi: 10.3969/j.issn.1006-4052.2019.01.010 [13] Xiong Y Q, Yan B B. Web crawler technology based on jsoup to crawl information of book web pages. Comput Inf Technol, 2019, 27(4): 61 doi: 10.3969/j.issn.1005-1228.2019.04.018熊艷秋, 嚴碧波. 基于jsoup爬取圖書網頁信息的網絡爬蟲技術. 電腦與信息技術, 2019, 27(4):61 doi: 10.3969/j.issn.1005-1228.2019.04.018 [14] Wang D W, Zhou Z W, Cao H G. Research on sentiment analysis of hotel review text based on PCA-SVM algorithm. Mod Comput, 2019(7): 13王大偉, 周志瑋, 曹紅根. 基于PCA-SVM算法的酒店評論文本情感分析研究. 現代計算機, 2019(7):13 [15] Tang R Z, Duan H C, Sun H T. Research on normalization of SVM training data. J Shandong Normal University Nat Sci, 2016, 31(4): 60湯榮志, 段會川, 孫海濤. SVM訓練數據歸一化研究. 山東師范大學學報: 自然科學版, 2016, 31(4):60 [16] Yang L W. Linguistic features of emergency news headlines: a corpus-driven empirical study. Shidai Wenxue, 2012(6): 132楊林偉. 突發事件新聞標題的語言學特點——一項語料庫驅動的實證研究. 時代文學(下半月), 2012(6):132 [17] Xiong Z B, Zhu J F, Yin C G. Application of regular expressions in the extraction of tourism emergency information. Comput Eng Software, 2015, 36(11): 15 doi: 10.3969/j.issn.1003-6970.2015.11.005熊志斌, 朱劍鋒, 尹成國. 正則表達式在旅游突發事件信息抽取中的應用. 軟件, 2015, 36(11):15 doi: 10.3969/j.issn.1003-6970.2015.11.005 [18] Zheng Z H, Wu W B, Chen X, et al. A traffic sensing and analyzing system using social media data. Acta Automatica Sinica, 2018, 44(4): 656鄭治豪, 吳文兵, 陳鑫, 等. 基于社交媒體大數據的交通感知分析系統. 自動化學報, 2018, 44(4):656 [19] Feng X. Triple-based document representation for text classification. Comput Eng Des, 2019, 40(2): 101馮雪. 基于三元組文檔表示的文本分類. 計算機工程與設計, 2019, 40(2):101 [20] Luo Y L, Zhao C Y. Extracting method of emergency news headline and text from webpages. J Comput Appl, 2014, 34(10): 2865 doi: 10.11772/j.issn.1001-9081.2014.10.2865羅永蓮, 趙昌垣. 突發事件新聞標題與正文提取方法. 計算機應用, 2014, 34(10):2865 doi: 10.11772/j.issn.1001-9081.2014.10.2865 [21] Liu J W, Li H E, Luo X L. Probabilistic graph model representation theory. Comput Sci, 2014, 41(9): 1 doi: 10.11896/j.issn.1002-137X.2014.09.001劉建偉, 黎海恩, 羅雄麟. 概率圖模型表示理論. 計算機科學, 2014, 41(9):1 doi: 10.11896/j.issn.1002-137X.2014.09.001 [22] Qu Q T, Liu Q C, Mu C X. A parallel adaptive news topic tracking algorithm based on N-Gram language model. J Shandong Univ Eng Sci, 2018, 48(6): 37屈慶濤, 劉其成, 牟春曉. 基于N-Gram語言模型的并行自適應新聞話題追蹤算法. 山東大學學報: 工學版, 2018, 48(6):37 [23] Yin C, Wu M. Survey on N-gram model. Comput Syst Appl, 2018, 27(10): 33尹陳, 吳敏. N-gram模型綜述. 計算機系統應用, 2018, 27(10):33 [24] Shi J, Han J, Zhao X K, et al. Research on core word extraction algorithm based on contextual concept. J China Soc Sci Tech Inf, 2019, 38(11): 1177 doi: 10.3772/j.issn.1000-0135.2019.11.006石進, 韓進, 趙小柯, 等. 基于語境概念核心詞提取算法研究. 情報學報, 2019, 38(11):1177 doi: 10.3772/j.issn.1000-0135.2019.11.006 [25] Li X, Jie H, Li L J. Research on sentence semantic similarity calculation based on Word2vec. Comput Sci, 2017, 44(9): 256 doi: 10.11896/j.issn.1002-137X.2017.09.048李曉, 解輝, 李立杰. 基于Word2vec的句子語義相似度計算研究. 計算機科學, 2017, 44(9):256 doi: 10.11896/j.issn.1002-137X.2017.09.048 -