言語処理100本ノック 2015年版 (50～54)

第6章: 英語テキストの処理

英語のテキスト（nlp.txt）に対して，以下の処理を実行せよ．

50. 文区切り

(. or ; or : or ? or !) → 空白文字 → 英大文字というパターンを文の区切りと見なし，入力された文書を1行1文の形式で出力せよ．

上記の条件を正規表現で表すとこんな感じになります。
r'\.\s[A-Z]|\;\s[A-Z]|\:\s[A-Z]|\?\s[A-Z]|\!\s[A-Z]'
問題はこれそのものが区切りではないこと。当然前の文には(. or ; or : or ? or !)を残したいし、後ろの文には英大文字を入れたい。
そこで、3文字を格納し、上記の正規表現と比較して、マッチすれば区切るようにしました。

#!/usr/bin/env python

import codecs
import re

fin = codecs.open('nlp.txt', 'r', 'utf_8')
punctuation = ""

if __name__ == "__main__":

  for line in fin:
    for x in line:
      punctuation = punctuation + x
      if len(punctuation) > 3:
        punctuation = punctuation[1:4]
        if re.search(r'\.\s[A-Z]|\;\s[A-Z]|\:\s[A-Z]|\?\s[A-Z]|\!\s[A-Z]',punctuation):
          print("")
      if x != "\n":
        print(x,end="")
  print("")

結果

Natural language processing
From Wikipedia, the free encyclopedia

Natural language processing (NLP) is a field of computer science, artificial intelligence, and linguistics concerned with the interactions between computers and human (natural) languages. 
As such, NLP is related to the area of humani-computer interaction. 
Many challenges in NLP involve natural language understanding, that is, enabling computers to derive meaning from human or natural language input, and others involve natural language generation.
(略)

51. 単語の切り出し

空白を単語の区切りとみなし，50の出力を入力として受け取り，1行1単語の形式で出力せよ．ただし，文の終端では空行を出力せよ．

文の終端では空行を出力せよとなっているので50の発展型。
単語が空白であれば改行する。実は"."や","に関する指示はないのですが、この後の設問を考えるとここで処理しておきたい。

#!/usr/bin/env python

import codecs
import re

fin = codecs.open('nlp.txt', 'r', 'utf_8')
punctuation = ""

if __name__ == "__main__":
  n = 0 
  for line in fin:
    for x in line:
      if n == 50:
        break
      punctuation = punctuation + x
      if len(punctuation) > 3:
        punctuation = punctuation[1:4]
        if re.search(r'\.\s[A-Z]|\;\s[A-Z]|\:\s[A-Z]|\?\s[A-Z]|\!\s[A-Z]',punctuation):
          print("")
      if x != "\n" and x != "." and x != ",":
        if x == " ":
          print("")
          n = n + 1
        else:
          print(x,end="")
  print("")

結果

Natural
language
processingFrom
Wikipedia
the
free
encyclopediaNatural
language
processing
(NLP)
is
a
field
of
computer
science
artificial
intelligence
and
linguistics
concerned
with
the
interactions
between
computers
and
human
(natural)
languages

As
such
NLP
is
related
to
the
area
of
humani-computer
interaction

Many
challenges
in
NLP
involve
natural
language
understanding
that

52. ステミング

51の出力を入力として受け取り，Porterのステミングアルゴリズムを適用し，単語と語幹をタブ区切り形式で出力せよ． Pythonでは，Porterのステミングアルゴリズムの実装としてstemmingモジュールを利用するとよい．

「ステム」とは「語幹」のことであり、ステミングとは「語幹化」のこと。上記Porterのステミングアルゴリズムが最も有名。
今まで標準出力への表示で誤魔化していたのでちゃんとリスト化する。
stemmingモジュールは標準で入っていないので追加でインストールしなければならない。

#!/usr/bin/env python

import codecs
import re
from stemming.porter2 import stem

fin = codecs.open('nlp.txt', 'r', 'utf_8')
punctuation = ""
src = []
string = []
word = ""

if __name__ == "__main__":
  n = 0 
  for line in fin:
    for x in line:
      if n == 50:
        break
      punctuation = punctuation + x
      if len(punctuation) > 3:
        punctuation = punctuation[1:4]
        if re.search(r'\.\s[A-Z]|\;\s[A-Z]|\:\s[A-Z]|\?\s[A-Z]|\!\s[A-Z]',punctuation):
          src.append(string)
          string = []
          word = ""
      if x == " ":
        if word != "":
          string.append(word)
          word = ""
          n = n + 1
      elif x == "\n":
        if word != "":
          string.append(word)
          word = ""
          n = n + 1
      elif x == "." or x == ",":
        print("",end="")
      else:
        word = word + x
  src.append(string)

  for stringx in src:
    for wordx in stringx:
      print(wordx,"\t",stem(wordx))
    print("")

結果

Many が Maniになるのがご愛嬌。

(前略)
Many 	 Mani
challenges 	 challeng
in 	 in
NLP 	 NLP
involve 	 involv
natural 	 natur
language 	 languag
（後略）

53. Tokenization

Stanford Core NLPを用い，入力テキストの解析結果をXML形式で得よ．また，このXMLファイルを読み込み，入力テキストを1行1単語の形式で出力せよ．

いつものことながら Stanford Core NLP のインストールに一苦労。
こちらを参考にインストール

Stanford CoreNLP を Python から使う方法まとめ
http://shirokai.hatenablog.com/entry/corenlp-python

python3では stanford_corenlp_pywrapper がうまく動かないので corenlp-python を用いようとしましたが、こちらも　pexpect　が働かないので諦めました。直接javaから触ります。

java -cp stanford-corenlp-3.6.0.jar:stanford-corenlp-3.6.0-models.jar:xom.jar:joda-time.jar:slf4j-api.jar:jollyday.jar -Xmx3g edu.stanford.nlp.pipeline.StanfordCoreNLP -annotators tokenize,ssplit,pos,lemma,ner,parse,dcoref -file nlp.txt

30分ぐらいかかって処理終了。

#!/usr/bin/env python

import codecs
import re

fin = codecs.open('nlp.txt.out', 'r', 'utf_8')
word = ""

if __name__ == "__main__":
  n = 0 
  for line in fin:
    if n == 50:
      break
    word = re.findall(r'<word>.*</word>',line)
    if word:
      print(word[0][6:-7]) 
      n = n + 1

結果

Natural
language
processing
From
Wikipedia
,
the
free
encyclopedia
(略)

出力するとピリオドやカンマが入っているのは分かりますが、"-LRB-"や"-RRB-"というのが増えています。これは(Left|Right) Round Brackeの略です。
今まではピリオドやカンマ、括弧を一つの文字としては扱っていませんでしたが、まあいいとしましょう。

54. 品詞タグ付け

Stanford Core NLPの解析結果XMLを読み込み，単語，レンマ，品詞をタブ区切り形式で出力せよ．

ひとつのトークンはこのような表示になっています。

          <token id="1">
            <word>Natural</word>
            <lemma>natural</lemma>
            <CharacterOffsetBegin>0</CharacterOffsetBegin>
            <CharacterOffsetEnd>7</CharacterOffsetEnd>
            <POS>JJ</POS>
            <NER>O</NER>
            <Speaker>PER0</Speaker>
          </token>

が単語、がレンマ、が品詞

#!/usr/bin/env python

import codecs
import copy
import re

fin = codecs.open('nlp.txt.out', 'r', 'utf_8')
src = []
token = {}
word = ""

if __name__ == "__main__":
  n = 0 
  for line in fin:
    if n == 50:
      break
    word = re.findall(r'<word>.*</word>',line)
    lemma = re.findall(r'<lemma>.*</lemma>',line)
    POS = re.findall(r'<POS>.*</POS>',line)
    if word:
      token['word']= word[0][6:-7]
    if lemma:
      token['lemma']= lemma[0][7:-8]
    if POS:
      token['POS']= POS[0][5:-6]
      src.append(copy.deepcopy(token))
      token = {}
      n = n + 1

  for tokenx in src:
    print(tokenx['word'],"\t",tokenx['lemma'],"\t",tokenx['POS'])

結果

Natural 	 natural 	 JJ
language 	 language 	 NN
processing 	 processing 	 NN
From 	 from 	 IN
Wikipedia 	 Wikipedia 	 NNP
, 	 , 	 ,
the 	 the 	 DT
free 	 free 	 JJ
encyclopedia 	 encyclopedia 	 NN
(略)

北野坂備忘録

主にインストールやプログラミングのメモを載せています。