pythonでEMアルゴリズムによる混合ガウス分布推定やってみた

こんにちは。

今回は、pythonを使ってEMアルゴリズムによる混合ガウス分布推定やってみました。

まず、EMアルゴリズムってなんすか？って方はこちらの記事を参考にしてみてください。

ちなみに数学的に理解するには尤度関数とか偏微分とか果てはイェンセンの不等式(私もこれは知りませんでした)とかが身についてないと理解できないです、それが苦痛な人は数学的理解はひとまず置いといて、概念的になにやってるのかを自分の口で説明できればそれでいいんじゃない？と思います。

自分的に説明するならば、正規分布がとある割合で混ざってて、それの比率を取り出すためにパラメータ推定するもの、ってイメージですね。

ひとまず結果見せた方が早いと思うのでやってみますよ

まずは正規分布まぜこぜなものを作成

import numpy as np
import matplotlib.pyplot as plt
from sklearn import mixture

dist1 = np.random.normal(0,1,75000)
dist2 = np.random.normal(3,0.5,25000)
dist = np.r_[dist1, dist2]

plt.hist(dist,bins = 100)
plt.title('正規分布 N(0,1)とN(3,0.5)を7500対2500で混合した分布')

f:id:doinakadoctor:20190912174354p:plain

こんなかんじっすね

ではGaussianMixtureっていうモジュールを使って解析してみましょう

dist1 = np.reshape(dist,(np.shape(dist)[0],1))
max_n_components = 8
lowest_bic = np.infty
lowest_bic_ix = 0

for i in range(1,max_n_components):
    clf = mixture.GaussianMixture(n_components = i, covariance_type = 'full')
    clf.fit(dist1)
    bic = clf.bic(dist1)
    print('bic',i,bic)
    if bic < lowest_bic:
        lowest_bic = bic
        lowest_bic_ix = i
print ('lowest  bic case no', lowest_bic_ix)

clf = mixture.GaussianMixture(n_components = lowest_bic_ix, covariance_type = 'full')
clf.fit(dist1)
print('weights:', clf.weights_)
print('means:', clf.means_)
print('std dev :', np.sqrt(clf.covariances_))