<th id="5nh9l"></th><strike id="5nh9l"></strike><th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th><strike id="5nh9l"></strike>
<progress id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"><noframes id="5nh9l">
<th id="5nh9l"></th> <strike id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span>
<progress id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span><strike id="5nh9l"><noframes id="5nh9l"><strike id="5nh9l"></strike>
<span id="5nh9l"><noframes id="5nh9l">
<span id="5nh9l"><noframes id="5nh9l">
<span id="5nh9l"></span><span id="5nh9l"><video id="5nh9l"></video></span>
<th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th>
<progress id="5nh9l"><noframes id="5nh9l">

一種面向網絡長文本的話題檢測方法

A topic detection method for network long text

  • 摘要: 提出了一種面向網絡長文本的話題檢測方法.針對文本表示的高維稀疏性和忽略潛在語義的問題,提出了Word2vec&LDA(latent dirichlet allocation)的文本表示方法.將LDA提取的文本特征詞隱含主題和Word2vec映射的特征詞向量進行加權融合既能夠進行降維的作用又可以較為完整的表示出文本信息.針對傳統話題發現方法對長文本輸入順序敏感問題,提出了基于文本聚類的Single-Pass&HAC(hierarchical agglomerative clustering)的話題發現方法,在引入時間窗口和凝聚式層次聚類的基礎上對于文本的輸入順序具有了更強的魯棒性,同時提高了聚類的精度和效率.為了評估所提出方法的有效性,本文從某大學社交平臺收集了來自真實世界的多源數據集,并基于此進行了大量的實驗.實驗結果證明,本文提出的方法相對于現有的方法,如VSM(state vector space model)、Single-Pass等擁有更好的效果,話題檢測的精度提高了10%~20%.

     

    Abstract: Internet public opinion is an important source of people's views on social hotspots and national current affairs. Topic detection in network long text contributes toward the analysis of network public opinion. According to the results of topic detection, the policymaker can timely and reliably make scientific decisions. In general, topic detection can be divided into two steps, i.e., representation learning and topic discovery. However, common representation learning methods, such as state vector space model (VSM) and term frequency-inverse document frequency, often lead to the problems of high dimensionality, sparsity, and latent semantic loss, whereas traditional topic discovery methods depend heavily on the text input orders. To overcome these, a novel topic detection method was presented herein. First, Word2vec & latent Dirichlet allocation (LDA)-based methods for representation learning were proposed to avoid the problem of high-dimensional sparsity and neglect of latent semantics. Weighted fusion of the text feature word implicit topic extracted by LDA and the feature word vector of Word2vec mapping could not only perform dimensionality reduction but also completely represent text information. Furthermore, Single-Pass and hierarchical agglomerative clustering for topic discovery could be more robust for input orders. To evaluate the effectiveness and efficiency of the proposed method, extensive experiments were conducted on a real-world multi-source dataset, which was collected from university social platforms. The experimental results show that the proposed method outperforms other methods, such as VSM and Single-Pass, by improving the clustering accuracy by 10%-20%.

     

/

返回文章
返回
<th id="5nh9l"></th><strike id="5nh9l"></strike><th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th><strike id="5nh9l"></strike>
<progress id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"><noframes id="5nh9l">
<th id="5nh9l"></th> <strike id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span>
<progress id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"><noframes id="5nh9l"><span id="5nh9l"></span><strike id="5nh9l"><noframes id="5nh9l"><strike id="5nh9l"></strike>
<span id="5nh9l"><noframes id="5nh9l">
<span id="5nh9l"><noframes id="5nh9l">
<span id="5nh9l"></span><span id="5nh9l"><video id="5nh9l"></video></span>
<th id="5nh9l"><noframes id="5nh9l"><th id="5nh9l"></th>
<progress id="5nh9l"><noframes id="5nh9l">
259luxu-164