一、命名實(shí)體識(shí)別Named Entity Recognition(NER)
NE Type | Examples |
---|---|
組織ORGANIZATION | Georgia-Pacific Corp., WHO |
人物PERSON | Eddy Bonte, President Obama |
地點(diǎn)LOCATION | Murray River, Mount Everest |
DATE | June, 2008-06-29 |
TIME | two fifty a m, 1:30 p.m. |
MONEY | 175 million Canadian Dollars, GBP 10.40 |
百分?jǐn)?shù)PERCENT | twenty pct, 18.75 % |
設(shè)施FACILITY | Washington Monument, Stonehenge |
政治地緣實(shí)體GPE | South East Asia, Midlothian |
s="""The fourth Wells account moving to another agency is the packaged paper-products division of Georgia-Pacific Corp., which arrived at Wells only last fall. Like Hertz and the History Channel, it is also leaving for an Omnicom-owned agency, the BBDO South unit of BBDO Worldwide. BBDO South in Atlanta, which handles corporate advertising for Georgia-Pacific, will assume additional duties for brands like Angel Soft toilet tissue and Sparkle paper towels, said Ken Haldin, a spokesman for Georgia-Pacific in Atlanta."""
s_w=nltk.word_tokenize(s) #分詞 s_tag=nltk.pos_tag(s_w) #POS 標(biāo)注 print(nltk.ne_chunk(s_tag)) #ne_chunk命名實(shí)體識(shí)別函數(shù) #print(nltk.ne_chunk(s_tag, binary=True)) #binary=True,,則實(shí)體都顯示為NE,否則顯示具體類別
練習(xí):根據(jù)上例,,完成下面文本的NER。
Guangdong University of Foreign Studies (GDUFS) is a major internationalized university in South China for its global-minded faculty/students and its research on international languages, literature, culture, trade and strategic studies.
Dating back to 1965 when the Guangzhou Institute of Foreign Languages was established and 1980 when the Guangzhou Institute of Foreign Trade was founded, the University had its present form by merging the two in 1995, with the Guangdong College of Finance and Economics incorporated into the University in 2008. The University has three campuses with a total area of 153 hectares: the North Campus at the foot of the Baiyun Mountain, the South Campus in Guangzhou Higher Education Mega Center, and Dalang Campus.
二,、關(guān)系抽取
如果命名實(shí)體被確定后,,就可以實(shí)現(xiàn)關(guān)系抽取來(lái)提取信息。一種方法是:尋找所有的三元組(X,a,Y),。其中X和Y是命名實(shí)體,,a是表示兩者關(guān)系的字符串,示例如下:
import nltk, re
IN = re.compile(r'.*\bin\b') #預(yù)先設(shè)定好正則表達(dá)式,,匹配單詞in
for doc in nltk.corpus.ieer.parsed_docs('NYT_19980315'):
for rel in nltk.sem.extract_rels('ORG', 'LOC', doc, corpus='ieer', pattern = IN):
print(nltk.sem.rtuple(rel))
三,、BosonNLP
https://bosonnlp.com/
中文語(yǔ)義開(kāi)放平臺(tái)