一,、Chunking 句子分塊
Segmentation and Labeling at both the Token and Chunk Levels
1 noun phrase chunking( NP-chunking)名詞短語(yǔ)
import nltk
sentence = [("the", "DT"), ("little", "JJ"), ("yellow", "JJ"),("dog", "NN"), ("barked", "VBD"), ("at", "IN"), ("the", "DT"), ("cat", "NN")]
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sentence)
print(result)
result.draw()
----------------------------------------------------------
如果覺得上述對(duì)句子人式標(biāo)注很麻煩,,可以改為以下:
import nltk
sentence="the little dog baked at the cat"
words=nltk.word_tokenize(sentence) #分詞
sent_tag=nltk.pos_tag(words) #加標(biāo)注
grammar = "NP: {<DT>?<JJ>*<NN>}"
cp = nltk.RegexpParser(grammar)
result = cp.parse(sent_tag)
print(result)
result.draw()
練習(xí)1:考慮下面的名詞短語(yǔ)如何表示,。
another/DT sharp/JJ dive/NN trade/NN figures/NNS any/DT new/JJ policy/NN measures/NNS earlier/JJR stages/NNS Panamanian/JJ dictator/NN Manuel/NNP Noriega/NNP
二,、句子結(jié)構(gòu)分析 1 句子嵌套 Usain Bolt broke the 100m recordb. The Jamaica Observer reported that Usain Bolt broke the 100m recordc. Andre said The Jamaica Observer reported that Usain Bolt broke the 100m recordd. I think Andre said the Jamaica Observer reported that Usain Bolt broke the 100m record 2 句子歧義 I shot an elephant in my pajamas.
http://www.nltk.org/book/ch08.html groucho_grammar = nltk.CFG.fromstring(""" S -> NP VP PP -> P NP NP -> Det N | Det N PP | 'I' VP -> V NP | VP PP Det -> 'an' | 'my' N -> 'elephant' | 'pajamas' V -> 'shot' P -> 'in' """) sent = ['I', 'shot', 'an', 'elephant', 'in', 'my', 'pajamas'] parser = nltk.ChartParser(groucho_grammar) for tree in parser.parse(sent): print(tree) 3 Context Free Grammar 上下文無(wú)關(guān)文法 (1)遞歸下降解析器 nltk.app.rdparser() (2)移進(jìn)-歸約解析器 nltk.app.srparser()