言語処理100本ノック 2015年版 (74～77)

74. 予測

73で学習したロジスティック回帰モデルを用い，与えられた文の極性ラベル（正例なら"+1"，負例なら"-1"）と，その予測確率を計算するプログラムを実装せよ．

　今度は重みベクトルを使って文の極性ラベルを計算していきます。
　与えられた文は74test.txtに書き込んでおきます。

#!/usr/bin/env python

import codecs
import re
import collections
import math
from nltk import stem

fsen = codecs.open('72.txt', 'r', 'latin_1')
stopwords = []

eta0 = 0.66
etan = 0.9999
guard = 0.0002

def checkstopwords(word,stopwords):
  return True if word in stopwords else False

def sigmoid(x):
  return 1.0 / (1.0 + math.exp(-x))

def update(W, features, label, eta):
  a = sum([W[x] for x in features])
  init_feature = 1 
  predict = sigmoid(a)
  label = ( label + 1) / 2

  for x in features:
    dif = eta * ( predict -label ) * init_feature
    if (W[x] - dif) > guard or ( W[x] - dif) < (guard * -1):
      W[x] = W[x] - dif

if __name__ == "__main__":

#重みベクトル作成
  t = 0
  W = collections.defaultdict(float)
  for line in fsen:
    features = line[:-1].split(" ")
    update(W, features[1:], float(features[0]), eta0 * ( etan **t))
    t += 1
  features = []
 
#ストップワード読み込み
  fstop = codecs.open('stoplist.txt', 'r', 'latin_1')
  stopwords = [ line[:-1] for line in fstop]

#比較文読み込み
  fin = codecs.open('74test.txt', 'r', 'latin_1')
  lemmatizer = stem.WordNetLemmatizer()

  for line in fin:
    string = re.compile(r'[,.:;\s]').split(line)
    for word in string:
      if not checkstopwords(word,stopwords):
        feature = lemmatizer.lemmatize(word)
        features.append(feature)
    print(features[1:])

  a = sum([W[x] for x in features[1:]])
  predict = sigmoid(a)
  predict = (predict * 2) - 1
  if predict > 0:
    predictlabel = "+1"
  elif predict < 0:
    predictlabel = "-1"
  else:
    predictlabel = "0"
  print("label",predictlabel,"\tpredict:",predict)

結果

['campy', 'result', 'mel', "brooks'", 'borscht', 'belt', 'schtick', 'look', 'sophisticated']
label -1 	predict: -0.6651139676000253

75. 素性の重み

73で学習したロジスティック回帰モデルの中で，重みの高い素性トップ10と，重みの低い素性トップ10を確認せよ．

unixコマンドを使います。
cut -f1,2 73.txt | sort -k2 -n -r > 73sort.txt

重みの高い素性トップ10

enjoyable 2.6169537969272794
powerful 2.3787951567849857
wonderful 2.252616193768806
help 2.2161462435618797
entertaining 2.1012476394052886
unexpected 2.06721146103483
capture 2.0487298445584528
provides 2.007309645804019
engrossing 1.9273056982929437
our 1.8582726739138051

重みの低い素性トップ10

fails -2.8814245455525906
dull -2.832095888121216
worst -2.356167001806389
boring -2.3536406580509825
none -2.033909838886688
flat -2.013097610029052
video -1.8959064537156256
badly -1.8830296620760196
lack -1.8803267135777417

　単語を見る限りいい感じで重みが付いているようです。
　それにしては videoが低いな……
　ラベル数を見ると、pos:21 vs neg:72 なので評価的には正しいですね。このコーパスではvideoはネガティブな評価と結びついていることが多いということです。

76. ラベル付け

学習データに対してロジスティック回帰モデルを適用し，正解のラベル，予測されたラベル，予測確率をタブ区切り形式で出力せよ．

#!/usr/bin/env python

import codecs
import re
import collections
import math

fsen = codecs.open('72.txt', 'r', 'latin_1')

eta0 = 0.66
etan = 0.9999
guard = 0.0002

def sigmoid(x):
  return 1.0 / (1.0 + math.exp(-x))

def update(W, features, label, eta):
  a = sum([W[x] for x in features])
  init_feature = 1 
  predict = sigmoid(a)
  label = ( label + 1) / 2

  for x in features:
    dif = eta * ( predict -label ) * init_feature
    if (W[x] - dif) > guard or ( W[x] - dif) < (guard * -1):
      W[x] = W[x] - dif

if __name__ == "__main__":
  t = 0
  W = collections.defaultdict(float)

  n = 0
  for line in fsen:
    
    features = line[:-1].split(" ")
    update(W, features[1:], float(features[0]), eta0 * ( etan **t))
    t += 1
    n += 1

  fsen = codecs.open('72.txt', 'r', 'latin_1')
  for line in fsen:
    features = line[:-1].split(" ")
    a = sum([W[x] for x in features[1:]])
    predict = sigmoid(a)
    predict = (predict * 2) - 1
    predictlabel = "+1" if predict > 0 else "-1"  
    print(features[0],"\t",predictlabel,"\t",predict)

結果

+1 	 -1 	 -0.45434815960064157
+1 	 +1 	 0.9839176217919916
-1 	 -1 	 -0.0724876620476631
+1 	 +1 	 0.9204549983820602
+1 	 +1 	 0.9008372084478018
+1 	 +1 	 0.9298453364568531
-1 	 -1 	 -0.7628898147320603
+1 	 +1 	 0.8194284936043321
(略)

　しょっぱなから予測を間違ってるのが泣ける。

77. 正解率の計測

76の出力を受け取り，予測の正解率，正例に関する適合率，再現率，F1スコアを求めるプログラムを作成せよ．

　ややこしいですが、

正例に関する適合率：　正例を正例と予測できた数 / 正例と予測した数
再現率：　正例を正例と予測できた数　/ 　実際の正例の数
F1スコア　適合率と再現率の調和平均
　( 2 * 適合率　* 再現率 ) / ( 適合率 + 再現率 )
となります。

#!/usr/bin/env python

import codecs
import re
import collections
import math

fsen = codecs.open('72.txt', 'r', 'latin_1')

eta0 = 0.66
etan = 0.9999
guard = 0.0002

def sigmoid(x):
  return 1.0 / (1.0 + math.exp(-x))

def update(W, features, label, eta):
  a = sum([W[x] for x in features])
  init_feature = 1 
  predict = sigmoid(a)
  label = ( label + 1) / 2

  for x in features:
    dif = eta * ( predict -label ) * init_feature
    if (W[x] - dif) > guard or ( W[x] - dif) < (guard * -1):
      W[x] = W[x] - dif

if __name__ == "__main__":
  t = 0
  W = collections.defaultdict(float)

  for line in fsen:
    features = line[:-1].split(" ")
    update(W, features[1:], float(features[0]), eta0 * ( etan **t))
    t += 1

  countp = 0
  countn = 0
  zero   = 0
  good   = 0
  bad    = 0
  exapos = 0
  trupos = 0
  fsen = codecs.open('72.txt', 'r', 'latin_1')
  for line in fsen:
    features = line[:-1].split(" ")
    a = sum([W[x] for x in features[1:]])
    predict = sigmoid(a)
    predict = (predict * 2) - 1
    predictlabel = "+1" if predict > 0 else "-1"
    print(features[0],"\t",predictlabel,"\t",predict)
# ポジネガ生成数
    if predictlabel == "+1":
      countp += 1
    else:
      countn += 1
# 正答率
    if features[0] == predictlabel :
      good += 1
    else :
      bad += 1
# 正例に関する適合率
    if features[0] == "+1" and predictlabel == "+1" :
      exapos += 1
# 正例の数
    if features[0] == "+1":
      trupos += 1

  accuracy_rate = good / ( good + bad)
  precision = exapos / countp
  recall = exapos / trupos
  f1 = ( 2 * precision * recall ) / ( precision + recall ) 
  print("p:",countp," n:",countn)
  print("good:",good," bad:",bad)
  print("accuracy_rate:", accuracy_rate )
  print("precision:", precision )
  print("recall:", recall )
  print("f1:", f1)

結果

p: 5425  n: 5270
good: 9640  bad: 1055
accuracy_rate: 0.9013557737260403
precision: 0.895483870967742
recall: 0.9087167976056865
f1: 0.9020518057747655

一旦ここで区切ります。

北野坂備忘録

主にインストールやプログラミングのメモを載せています。