久久精品第一国产久精国产宅男66 ,琪琪网三级伦锂电影

Building and Querying an Enterprise Knowledge Graph 原文與筆記

來源: 鄭東陽/

華南師范大學(xué)

1477

2020-08-08 11:20:05

2020-08-08

Building and Querying an Enterprise Knowledge Graph

1,、Introdution

問題：

unprecedented amount,、diverse sources,、heterogeneous formats
通過不同的方法從不同的來源獲取數(shù)據(jù)
采用多種技術(shù)從數(shù)據(jù)中挖掘有用的信息
挖掘的信息進(jìn)一步與現(xiàn)有的結(jié)構(gòu)化數(shù)據(jù)集成（實(shí)體鏈接技術(shù)）
自然語言接口：TR Discover；將自然語言問題被轉(zhuǎn)換成可執(zhí)行的查詢以進(jìn)行答案檢索

1.1,、提供更好的查詢結(jié)果需解決：

1) How to process and mine useful information from large amount of unstructured and structured data
2) How to integrate such mined information for the same entity across disconnected data sources and store them in a manner for easy and efficient access
3) How to quickly find the entities that satisfy the information needs of today’s knowledge workers

1.2,、因此需要做到：

ingest and consume the data in a scalable manner（可伸縮的方式）This data ingestion process needs to be robust enough to be capable of processing all types of data
add structure to these free text documents（patent filings, financial reports, academic publications, etc）
cannot leave this data sitting in separated “silos” 集成數(shù)據(jù)

1.3、數(shù)據(jù)模型選擇

EntityRelationship（ER）：技術(shù)成熟,，但難以快速更新,，只能keyword查詢
RDF模型：靈活，三元組格式表示數(shù)據(jù),，沒有固定的模式,；RD允許建模數(shù)據(jù)的更具表現(xiàn)力的語義，可用于知識推理

1.4,、查詢方式

關(guān)鍵字查詢：前者不能準(zhǔn)確表達(dá)用戶的查詢意圖,，尤其是涉及關(guān)系或其他限制的問題，如時間約束
專用查詢語言（SQL和SPARQL2）：有專業(yè)背景要求

2,、Data Acquisition Transformation Interlinking

2.1,、數(shù)據(jù)獲取

structured data: link each entity in the data to the relevant nodes in our graph and update the information of the nodes being linked to. 將數(shù)據(jù)中的每個實(shí)體鏈接到圖中的相關(guān)節(jié)點(diǎn)，并更新所鏈接節(jié)點(diǎn)的信息
unstructured data:first perform information extraction to extract the entities and their relationships with other entities; such extracted structured data is then integrated into our knowledge graph.首先進(jìn)行信息抽取,，提取出實(shí)體及其與其他實(shí)體的關(guān)系,；然后將提取的結(jié)構(gòu)化數(shù)據(jù)集成到知識圖譜中

2.2、命名實(shí)體識別與關(guān)系抽取命名實(shí)體識別

Named Entity Recognition:use natural language processing techniques that include both rule-based and machine learning algorithms.
Relation Extraction:machine learning classifier that predicts the probability of a possible relationship for a given pair of identified entities in a given sentence

2.3,、實(shí)體鏈接

matching the attribute values of the nodes in the graph and that of a new entity

3,、Data Modeling and Physical storage

數(shù)據(jù)模型：

RDF通常被描述為有向和有標(biāo)記的圖，但一組三元組,，每一個三元組都以形式由主語,、謂語和賓語組成。三元組存儲在三元組存儲區(qū)中,，并使用SPARQL查詢語言進(jìn)行查詢,。用三元組表示數(shù)據(jù)需要一個模型（類似于關(guān)系數(shù)據(jù)庫），但RDF支持豐富語義的表達(dá)并支持知識推理,。采用RDF模型的另一大優(yōu)點(diǎn)是它可以更容易地刪除和更新數(shù)據(jù),。

物理存儲：

index the triples on their subject, predicate and object respectively with the Elastic search engine.
build a full-text search index on objects that are literal values, where such literal values are tokenized and treated as terms in the index.

4、 Query the KG with natural language

TR Discover

a natural language interface that is designed to bridge the gap between keyword-based search and structured query
auto-suggest mechanism (help users to complete their questions)
將用戶的自然語言mapping到中介語言,，再將中介語言轉(zhuǎn)化為 standard query language
步驟：Question Understanding --> Enabling Question Completion with Auto-suggest --> Question Translation and Execution

First Order Logic (FOL).

The FOL representation of a natural language question is further translated to an executable query
1 parse the FOL representation into a parse tree by using an FOL parser
2 then perform an in-order traversal of the FOL parse tree and translate it to an executable query

5,、Conclusion

In this paper, we present our effort in building and querying Thomson Reuters’ knowledge graph. Data in heterogeneous formats is first acquired from various sources. We then develop named entity recognition, relation extraction and entity linking techniques for mining information from the data and integrating the mined data across different sources. We model and store our data in RDF triples, and present TR Discover that enables users to search for information with natural language questions. We evaluate and demonstrate the practicability of our knowledge graph. In future work, we would like to enhance our NLP algorithms in order to cover more domains. Also, rather than relying on a pre-defined grammar for understanding natural language questions, we will explore the possibility of developing a more flexible question parser. Finally, we will deploy our knowledge graph to more products and improve our various services according to customer feedback.