言語処理100本ノック 2015年版 (85-89)

（追記）このあたり見事にしくじってますが、後日リベンジしております。
kenichia.hatenablog.com

85. 主成分分析による次元圧縮

84で得られた単語文脈行列に対して，主成分分析を適用し，単語の意味ベクトルを300次元に圧縮せよ．

　単語の意味ベクトルを300次元に圧縮？
　word2vecだコレ！
　じゃあsparkを使えばいいのか……？
　いや。これはプログラミングの勉強だから、sparkを使うのではなくて自分で実装しろということではないか。いってみればsparkのようなことをさせているのではないか。
と思ったら10章がword2vecそのものでした。
　このあとの問題のことを考え、今までのaa.txtではなく[a-zA-Z][a-zA-Z]ファイルマトリックス内のデータを用います。

aa ab ac .. aZ
ba bb bc .. bZ
ca cb cc .. cZ
.. .. .. .. ..
Za Zb Zc .. ZZ

cat aa.txt ... ZZ.txt > azAZ.txt
63678277行のデータになりました。
それぞれ必要な情報を計算していきます。

cat azAZ.txt | sort -k 1,2 |uniq -c|sort -nr > azAZlc.txt
sed -i 's/^\s*//g' azAZlc.txt
cut -f 1 azAZ.txt | sort -k 1 |uniq -c|sort -nr > azAZtc.txt
sed -i 's/^\s*//g' azAZtc.txt
cut -f 2 azAZ.txt | sort -k 1 |uniq -c|sort -nr > azAZcc.txt
sed -i 's/^\s*//g' azAZcc.txt

最後に azAZlc.txtから10回以上出現した azAZlc10.txtを作成。

結果(8584.txt)

of      the      0.6498982431410926
the     in       0.26905021196949763
to      the      0.11514694105843744
of      a        0.1950959541456405
for     the      0.20330228665922276
on      the      0.30716946275958495
a       in       0.081033546667135
by      the      0.120290911408584
with    the      0.03644493062394107
(略)

主成分分析かぁ……R使いたいなあ……（pythonのスキルアップのためにやっているので本末転倒です）
単語の意味ベクトルは

*	a	and		as	・・
a	a&a	a&and		a&as	・・
and	and&a	and&and	and&as	・・
as	as&a	as&and		as&as	・・
・・	・・	・・		・・	・・

のようになっています。
で、左に並んでいる単語に対し、右上に並んでいる文脈語を次元として、300次元に圧縮しろというのが今回の設問です。
pythonで主成分分析をするならscikit-learn。sparkを使えという設問ではないと考えて進めます。

疎行列にしたいので、まず単語文脈行列(8584.txt) を2重の辞書にします。
{a:{a:1,b:2,c:3,...},and:{a:1,b:2,c:3,...},...}
単語に「文脈語:値」の辞書リストを持たせるわけです。
sklearn.feature_extraction.DictVectorizerを使います。これは素性名，素性値のペアが入ったdictを投げつけると疎行列 (scipy.sparse.csr_matrix) に変換してくれるクラスです。
ただし、素直に主成分分析すると私の環境ではだいたい25万行の意味ベクトルのうち3万行読み込んだ時点でメモリエラー。
そこで、こちらを参照に行列をmemmap化しました。
http://kensuke-mi.xyz/kensuke-mi_diary/2014/09/numpy-memmap.html
なんとか3万行までは処理できるようになりましたが、こんどは toarray() でしくじる。メモリエラーを出したり6時間たっても返事しなかったりします。どうにかして疎行列のまま処理できんかな。
試してみたもの：
sklearn.decompositionの
　PCA
　Incremental PCA
　RandomizedPCA
これらは全てdense array化が必要。つまり toarray()で詰む。RandomizedPCAはむかし疎行列を突っ込めたけど今は突っ込めない。
主成分分析じゃなくて特異値分解じゃダメですかね……？
というわけで TruncatedSVD。主成分分析とすげえ似てるけど主成分分析ではない。
残念ながらこれでも595859行の処理は強制終了。もうもともとのデータを減らすしかない。
awkで行数を半減させます。
awk 'NR%2==1' 8584.txt.org > 8584.txt

#!/usr/bin/env python

import codecs
import re
import copy
import numpy as np
from sklearn.feature_extraction import DictVectorizer
from sklearn.decomposition import TruncatedSVD

fin = codecs.open('8584.txt', 'r', 'utf_8')
fout = codecs.open('85vec.txt', 'w', 'utf_8')
contextdic = {}
worddic = {}
keylist = []
vectorlist=[]

if __name__ == "__main__":
  for line in fin: #読み込み
    string = re.split("\t",line[:-1])
    context=string[1]
    value = string[2]
    contextdic[context[:-1]]=value[1:] #二重辞書化
    if string[0] in worddic:
      worddic[string[0]].update(contextdic)
    else:
      worddic[string[0]]=contextdic
    contextdic={}

  n=0
  for k,v in worddic.items(): #リスト化
    keylist.append(k)
    vectorlist.append(v)
    print("vlist:",n)
    n += 1

  vec = DictVectorizer(sparse=True)
  array_vectors=vec.fit_transform(vectorlist)
  tsvd = TruncatedSVD(n_components=300)
  word_pca = tsvd.fit_transform(array_vectors)

  n=0
  while n < len(keylist):
    fout.write(keylist[n])
    fout.write(" ")
    for m in word_pca[n]:
      precision = 6
      m = str(np.round(m, precision))
      fout.write(m)
      fout.write(" ")
    n += 1
    fout.write("\n")

いけました。25330語の意味ベクトルになりました。

結果(85vec.txt)

Wikipedia 0.036607 0.061894 -0.074259 -0.343874 -0.302429 0.030418 -0.022152 0.408875 0.716387 -0.03836 -0.13187 -0.105147 -0.136955 -0.117247 -0.142686 -0.038827 -0.067474 0.012758 0.035452 -0.034669 0.022411 0.018073 -0.042144 0.015336 0.003743 -0.005312 -0.010827 0.009409 0.006696 8.5e-05 -0.015763 -0.00563 0.001709 0.005587 0.004114 -0.000124 -0.003196 -0.011369 -0.006669 0.012481 0.007403 -0.004698 -0.010032 0.001192 -0.001672 -0.003834 0.004263 -0.005571 -0.008854 0.005557 
(略)

86. 単語ベクトルの表示

85で得た単語の意味ベクトルを読み込み，"United States"のベクトルを表示せよ．ただし，"United States"は内部的には"United_States"と表現されていることに注意せよ．

United_States 0.532173 0.780891 -0.517928 -1.431203 -0.586933 -0.183612 0.113353 -1.113398 -1.190547 0.156925 -0.237816 -0.719958 -0.645294 -0.815169 -1.181049 -0.094529 -0.298039 -0.318876 -0.512742 -0.834399 0.312021 -0.147455 -0.520111 0.279153 -0.358644 0.235239 0.060402 -1.085956 0.278782 -0.323167 -0.671406 -0.492222 0.670645 0.139045 -0.300085 -0.201374 -0.008507 -0.024414 -0.542766 1.036288 0.666017 0.053405 -0.262103 -0.307392 -0.017193 -0.59792
(略)

87. 単語の類似度

85で得た単語の意味ベクトルを読み込み，"United States"と"U.S."のコサイン類似度を計算せよ．ただし，"U.S."は内部的に"U.S"と表現されていることに注意せよ．

#!/usr/bin/env python

import codecs
import re
import numpy as np

fin = codecs.open('85vec.txt', 'r', 'utf_8')
a = []
b = []

if __name__ == "__main__":

  for line in fin: #読み込み
    string1 = re.split(" ",line[:-1])
    if string1[0] == "United_States":
      print(len(string1))
      for x in string1[1:301]:
        a.append(float(x))
      break

  fin = codecs.open('85vec.txt', 'r', 'utf_8')
  for line in fin: #読み込み
    string2 = re.split(" ",line[:-1])
    if string2[0] == "U.S":
      print(len(string2))
      for x in string2[1:301]:
        b.append(float(x))
      break

  npa = np.array(a)
  npb = np.array(b)
  cos_sim = np.dot(npa, npb) / ( np.linalg.norm(npa) * np.linalg.norm(npb) )
  print(cos_sim)

0.1330308992
やばいだいぶ低い……やはり特異値分解ではダメだったのか……それとも純粋にデータ数が足りない？

88. 類似度の高い単語10件

85で得た単語の意味ベクトルを読み込み，"England"とコサイン類似度が高い10語と，その類似度を出力せよ．

#!/usr/bin/env python

import codecs
import re
import numpy as np

fin = codecs.open('85vec.txt', 'r', 'utf_8')
a = []
b = []

def cos_sim(x, y):
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

if __name__ == "__main__":

  for line in fin: #読み込み
    string1 = re.split(" ",line[:-1])
    if string1[0] == "England":
      print(len(string1))
      for x in string1[1:301]:
        a.append(float(x))
      npa = np.array(a)
      break

  fin = codecs.open('85vec.txt', 'r', 'utf_8')
  for line in fin: #読み込み
    string2 = re.split(" ",line[:-1])
    for x in string2[1:301]:
      b.append(float(x))
    npb = np.array(b)
    print(string2[0],end=" ")
    print("{0:f}".format(cos_sim(npa,npb)))
    b = []

88.txtとして出力し、nanを取り除いてから並べ替えます。

:%g/nan$/d
cut -d ' ' -f 1,2 88.txt | sort -k 2 -nr > 88sort.txt

結果(88sort.txt)

mansion 0.882204
brought 0.305319
short 0.296156
regular 0.289981
numerous 0.288348
Europe 0.284525
reported 0.276977
TV 0.274960
whom 0.274658

うーん、かなり精度が悪いぞ。いったい mansionがEnglandと何の関係が……。

89. 加法構成性によるアナロジー

85で得た単語の意味ベクトルを読み込み，vec("Spain") - vec("Madrid") + vec("Athens")を計算し，そのベクトルと類似度の高い10語とその類似度を出力せよ．

#!/usr/bin/env python

import codecs
import re
import numpy as np

fin = codecs.open('85vec.txt', 'r', 'utf_8')

def cos_sim(x, y):
    return np.dot(x, y) / (np.linalg.norm(x) * np.linalg.norm(y))

def getvec(word):
  a = []
  fin = codecs.open('85vec.txt', 'r', 'utf_8')
  for line in fin: #読み込み
    string1 = re.split(" ",line[:-1])
    if string1[0] == word:
      for x in string1[1:301]:
        a.append(float(x))
      npx = np.array(a)
      break
  return npx

if __name__ == "__main__":

  npa = getvec("Spain")
  npb = getvec("Madrid")
  npc = getvec("Athens")
  npd = npa - npb + npc

  b=[]
  fin = codecs.open('85vec.txt', 'r', 'utf_8')
  for line in fin: #読み込み
    string2 = re.split(" ",line[:-1])
    for x in string2[1:301]:
      b.append(float(x))
    npb = np.array(b)
    print(string2[0],end=" ")
    print("{0:f}".format(cos_sim(npd,npb)))
    b = []

　同様にnanを取り除いてソート。

:%g/nan$/d
cut -d ' ' -f 1,2 89.txt | sort -k 2 -nr > 89sort.txt

結果(89sort.txt)

Spain 0.768587
Music 0.593657
crime 0.518666
Italy 0.482300
Japan 0.482161
emergency 0.480686
Fire 0.480611
division 0.480581
centuries 0.480228
media 0.477803

ようはGreeceと出てほしいけど出てこない。
残念ながらGreeceはもっと下でした。
Greece -0.000892

第9章は本当にうまくいかなかったなー。投入できたデータの件数が少ないせいかそもそもTruncatedSVDではダメなのか……。

北野坂備忘録

主にインストールやプログラミングのメモを載せています。