問(wèn)題:
unprecedented amount,、diverse sources,、heterogeneous formats
通過(guò)不同的方法從不同的來(lái)源獲取數(shù)據(jù)
采用多種技術(shù)從數(shù)據(jù)中挖掘有用的信息
挖掘的信息進(jìn)一步與現(xiàn)有的結(jié)構(gòu)化數(shù)據(jù)集成(實(shí)體鏈接技術(shù))
自然語(yǔ)言接口:TR Discover,;將自然語(yǔ)言問(wèn)題被轉(zhuǎn)換成可執(zhí)行的查詢以進(jìn)行答案檢索
1) How to process and mine useful information from large amount of unstructured and structured data
2) How to integrate such mined information for the same entity across disconnected data sources and store them in a manner for easy and efficient access
3) How to quickly find the entities that satisfy the information needs of today’s knowledge workers
ingest and consume the data in a scalable manner(可伸縮的方式)This data ingestion process needs to be robust enough to be capable of processing all types of data
add structure to these free text documents(patent filings, financial reports, academic publications, etc)
cannot leave this data sitting in separated “silos” 集成數(shù)據(jù)
EntityRelationship(ER):技術(shù)成熟,,但難以快速更新,,只能keyword查詢
RDF模型:靈活,,三元組格式表示數(shù)據(jù),沒(méi)有固定的模式,;RD允許建模數(shù)據(jù)的更具表現(xiàn)力的語(yǔ)義,,可用于知識(shí)推理
關(guān)鍵字查詢:前者不能準(zhǔn)確表達(dá)用戶的查詢意圖,,尤其是涉及關(guān)系或其他限制的問(wèn)題,,如時(shí)間約束
專用查詢語(yǔ)言(SQL和SPARQL2):有專業(yè)背景要求
structured data: link each entity in the data to the relevant nodes in our graph and update the information of the nodes being linked to. 將數(shù)據(jù)中的每個(gè)實(shí)體鏈接到圖中的相關(guān)節(jié)點(diǎn),,并更新所鏈接節(jié)點(diǎn)的信息
unstructured data:first perform information extraction to extract the entities and their relationships with other entities; such extracted structured data is then integrated into our knowledge graph.首先進(jìn)行信息抽取,提取出實(shí)體及其與其他實(shí)體的關(guān)系,;然后將提取的結(jié)構(gòu)化數(shù)據(jù)集成到知識(shí)圖譜中
Named Entity Recognition:use natural language processing techniques that include both rule-based and machine learning algorithms.
Relation Extraction:machine learning classifier that predicts the probability of a possible relationship for a given pair of identified entities in a given sentence
matching the attribute values of the nodes in the graph and that of a new entity
RDF通常被描述為有向和有標(biāo)記的圖,,但一組三元組,,每一個(gè)三元組都以形式由主語(yǔ)、謂語(yǔ)和賓語(yǔ)組成,。三元組存儲(chǔ)在三元組存儲(chǔ)區(qū)中,,并使用SPARQL查詢語(yǔ)言進(jìn)行查詢。用三元組表示數(shù)據(jù)需要一個(gè)模型(類似于關(guān)系數(shù)據(jù)庫(kù)),,但RDF支持豐富語(yǔ)義的表達(dá)并支持知識(shí)推理,。采用RDF模型的另一大優(yōu)點(diǎn)是它可以更容易地刪除和更新數(shù)據(jù)。
index the triples on their subject, predicate and object respectively with the Elastic search engine.
build a full-text search index on objects that are literal values, where such literal values are tokenized and treated as terms in the index.
auto-suggest mechanism (help users to complete their questions)
將用戶的自然語(yǔ)言mapping到中介語(yǔ)言,,再將中介語(yǔ)言轉(zhuǎn)化為 standard query language
步驟:Question Understanding --> Enabling Question Completion with Auto-suggest --> Question Translation and Execution
The FOL representation of a natural language question is further translated to an executable query
1 parse the FOL representation into a parse tree by using an FOL parser
2 then perform an in-order traversal of the FOL parse tree and translate it to an executable query
In this paper, we present our effort in building and querying Thomson Reuters’ knowledge graph. Data in heterogeneous formats is first acquired from various sources. We then develop named entity recognition, relation extraction and entity linking techniques for mining information from the data and integrating the mined data across different sources. We model and store our data in RDF triples, and present TR Discover that enables users to search for information with natural language questions. We evaluate and demonstrate the practicability of our knowledge graph. In future work, we would like to enhance our NLP algorithms in order to cover more domains. Also, rather than relying on a pre-defined grammar for understanding natural language questions, we will explore the possibility of developing a more flexible question parser. Finally, we will deploy our knowledge graph to more products and improve our various services according to customer feedback.