《計算機應用研究》|Application Research of Computers

基于開放域抽取的多文檔概念圖構建研究

Multi-document conceptual graph construction research based on open domain extraction

免費全文下載 (已被下載 次)  
獲取PDF全文
作者 盛泳潘,付雪峰,吳天星
機構 1.電子科技大學 計算機科學與工程學院,成都 611731;2.南昌工程學院 信息工程學院,南昌 330099;3.東南大學 計算機科學與工程學院,南京 211189
統計 摘要被查看 次,已被下載
文章編號 1001-3695(2020)01-004-0019-07
DOI 10.19734/j.issn.1001-3695.2018.05.0454
摘要 在信息過載的背景下,如何從擁有共同主題的多篇文檔中挖掘并組織核心概念及其語義連接已成為當前信息抽取任務中的一項重要挑戰。為此,提出了一種新穎的基于開放域抽取的多文檔概念圖構建方法。首先基于預定主題挖掘主題詞,通過改進的TF-IDF算法對文檔進行排序;然后通過共指消解、篇章權重計算、三元組實例抽取等一系列步驟從多篇文章中抽取出大量具有事實表達能力的三元組實例。為去除開放域方法本身的噪聲以及提高信息抽取的準確率,提出一種三元組實例過濾算法。通過該算法可有效提取高置信度且具有良好語義兼容性的顯著關系實例集合,并構成多個概念子圖。最后,將不同子圖中的等價概念以及關系進行合并,形成一張具有較好主題表達能力的連通概念圖。通過在signal media新聞數據集上進行驗證,實驗結果表明,所提出的方法能夠跨文檔組織重要的主題信息,形成的概念圖在主題概念覆蓋率、關系實例的兼容性等指標上均取得了較好的效果。在實際的應用場景中,概念圖作為一種重要的多文檔內容表現形式,對于用戶進一步探索指定主題的發展脈絡以及生成自動文檔摘要均具有重要的參考價值。
關鍵詞 開放域抽??; 多文檔; 概念圖構建
基金項目 國家自然科學基金資助項目(61762063)
江西省自然科學基金資助項目(20171BAB202024)
江西省教育廳科研項目(GJJ170991)
國家建設高水平大學公派研究生項目(201706070049)
本文URL http://www.048285.live/article/01-2020-01-004.html
英文標題 Multi-document conceptual graph construction research based on open domain extraction
作者英文名 Sheng Yongpan, Fu Xuefeng, Wu Tianxing
機構英文名 1.School of Computer Science & Engineering,University of Electronic Science & Technology of China,Chengdu 611731,China;2.School of Information Engineering,Nanchang Institute of Technology,Nanchang 330099,China;3.School of Computer Science & Engineering,Southeast University,Nanjing 211189,China
英文摘要 In the background of information overload, this is challenging to mine and organize meaningful concepts and their semantic connections from a set of related documents under the same topic in information extraction. Thus, this paper proposed a novel multi-document conceptual graph construction method based on open-domain information extraction. Firstly, documents were ranked according to the improved TF-IDF weight of extracted topic words under the predefined topics, then the method relayed on a serious of methods, including coreference resolution, weight computation, triple instance extraction steps, to extract numerous representative subject-predicate-object triples from multiple documents. For filtering out the noise of open-domain information approach itself and improving the accuracy of information extraction, this paper presented a triple filtering algorithm to retain only the most salient, confident and compatible triples, which can form multiple conceptual subgraphs. Finally, in combined with the equivalent concepts and relationships across different subgraphs to connect into a fully connected conceptual graph. Experiments on signal media dataset illustrate that the proposed method has the capacity to discern key topic information corresponds to the specific topic within and across documents, and the formed conceptual graph achieves the good performance in terms of the coverage rate of topic concepts as well as the compatible triples. In actual circumstance, conceptual graph can be regarded as an important representation form of multiple documents and has the important significance for further exploring advance of the topic and generating automatic document abstraction.
英文關鍵詞 open-domain extraction; multiple documents; conceptual graph construction
參考文獻 查看稿件參考文獻
 
收稿日期 2018/5/23
修回日期 2018/8/6
頁碼 19-25
中圖分類號 TP391
文獻標志碼 A
012曾道人三尾中特书