K平均法によるクラスタの作成

今回やること

前回の階層型クラスタに続き，K平均法クラスタを作成します．入力データも"blogdata.txt"を引き続き用います．

KmeansClusterクラスの作成

K平均法クラスタのソースを示します．

module My
  class KmeansCluster
    LOOP_MAX = 100

    def initialize(word_counts, user_options = {})
      @word_counts = word_counts
      @min_and_max = {}
      @centroids = {}
      @cluster = Hash.new { |hash, key| hash[key] = [] }
      @options = {
        :centroids => 4
      }.merge(user_options)
    end

    # 重心をランダム値で初期化してから，相関の近いURLを所属させる
    # 相関の近いURLの平均値を新たな重心の値とし，再計算する
    # もし重心が前回と計算した結果と同じになったら，最適解と見なす
    def make_cluster
      @min_and_max = min_and_max_in_word_counts
      @centroids = random_centroids

      loop_counter = 0
      old_centroids = nil
      until (@centroids == old_centroids) or (LOOP_MAX < loop_counter)
        puts "Iteration #{loop_counter}"
        loop_counter += 1
        attach_urls_to_nearest_centroid
        old_centroids = Marshal.load(Marshal.dump(@centroids)) # ディープコピー

        # 重心に所属するURLがない場合は何もしない
        @centroids.each_key do |centroid|
          @centroids[centroid] = average_attached(centroid) if @cluster[centroid].any?
        end
      end
    end

    def print_cluster
      pp @cluster.values
    end

    private

    # 全てのURLにおいて各単語の出現回数のうち最大と最小のものを算出する
    def min_and_max_in_word_counts
      all_counts = Hash.new { |hash, key| hash[key] = [] }
      min_and_max = {}

      @word_counts.each do |url, counts|
        counts.each do |word, count|
          all_counts[word] << count
        end
      end

      all_counts.each do |word, counts|
        min_and_max[word] = Pair.new [counts.min, counts.max]
      end
      min_and_max
    end

    # 各単語の最大値と最小値の間でランダム値を生成する
    def random_centroids
      centroids = {}

      @options[:centroids].times do |centroid|
        random_counts = {}
        @min_and_max.each do |word, counts|
          random_counts[word] = rand(counts.max - counts.min) + counts.min
        end
        centroids[centroid] = random_counts
      end
      centroids
    end

    # 最も相関の近い重心に所属させる
    # 計算前に，一旦クラスタを空にする
    def attach_urls_to_nearest_centroid
      @cluster.clear

      @word_counts.each_key do |url|
        @cluster[nearest_centroid(url)] << url
      end
    end

    # 各重心に対して，URLごとの相関値を計算し，最も相関が高い重心を返す
    # 相関値は0に近い値が最も相関が大きくなるように"1 - ピアソン相関"とする
    def nearest_centroid(url)
      correlations = @centroids.map { |centroid, centroid_word_count|
        # valuesメソッドを使うと順序が保証されないので，別変数にvalueをそれぞれ格納する
        web_counts = []
        centroid_counts = []
        @word_counts[url].each do |word, count|
          web_counts << count
          centroid_counts << centroid_word_count[word]
        end
        1 - Pearson.calc(web_counts, centroid_counts)
      }

      correlations.rindex(correlations.min { |x, y| x.abs <=> y.abs })
    end

    # 任意の重心に属する全URLのword_count平均を計算する
    # URLごとのword_countをtransposeで単語単位に変換している
    def average_attached(centroid)
      average_word_counts = @cluster[centroid].map { |url|
        @centroids[centroid].keys.map { |word| @word_counts[url][word] }
      }.transpose.map { |all_counts|
        all_counts.inject(0) { |sum, count| sum + count }.quo(all_counts.size)
      }

      Hash[*@centroids[centroid].keys.zip(average_word_counts).flatten]
    end
  end
end

処理の流れ

make_clusterメソッド内でメインループを司る処理を行っています．

def make_cluster
  ...
  until (@centroids == old_centroids) or (LOOP_MAX < loop_counter)
    ...
    attach_urls_to_nearest_centroid
    old_centroids = Marshal.load(Marshal.dump(@centroids))

    @centroids.each_key do |centroid|
      @centroids[centroid] = average_attached(centroid) if @cluster[centroid].any?
    end
  end
end

@centroidsに重心(の持つデータ)が格納されており，until構文の中で重心を移動させながら，
前回との誤差が0になるまで繰り返します．

ディープコピー

先ほどさりげなく出てきたこの文，何やってるか分かりますか?

old_centroids = Marshal.load(Marshal.dump(@centroids))

見出しにあるようにディープコピーをしています．
通常，Rubyではオブジェクトをコピーする際にObject#dupで複製を得ますが，dupメソッドでは浅い(Shallow)コピーしかできません．
@centroidsはHashの要素にArrayを持ったりしているため，dupではコピーしきれないのです．

そこで，Marshalでdumpとloadを行いディープコピーします．
本来Marshalはオブジェクトの永続化とかに使うのですが，こういう使い方もあるらしいです．

動作確認

require 'pair'
require 'pearson'

cluster = My::KmeansCluster.new(blog_data_from('blogdata.txt'), { :centroids => 4 })
cluster.make_cluster
cluster.print_cluster

KmeansClusterを実行すると↓のようになります．
(今回もとても長いので，非表示にしてあります)

Iteration 0
Iteration 1
Iteration 2
Iteration 3
Iteration 4
Iteration 5
[["Quick Online Tips",
  "MAKE Magazine",
  "Lifehacker",
  "Pharyngula",
  "Copyblogger",
  "Joel on Software",
  "ProBlogger Blog Tips",
  "Autoblog",
  "PaulStamatiou.com",
  "Mashable!",
  "Treehugger",
  "Cool Hunting",
  "ScienceBlogs : Combined Feed",
  "43 Folders",
  "Slashdot",
  "Steve Pavlina's Personal Development Blog",
  "Wired News: Top Stories",
  "456 Berea Street",
  "Signum sine tinnitu--by Guy Kawasaki",
  "TMZ.com",
  "Online Marketing Report",
  "SimpleBits"],
 ["Joystiq",
  "CoolerHeads Prevail",
  "Engadget",
  "Blog Maverick",
  "TechEBlog",
  "Download Squad",
  "The Unofficial Apple Weblog (TUAW)"],
 ["Read/WriteWeb",
  "Micro Persuasion",
  "Bloglines | News",
  "Techdirt",
  "Publishing 2.0",
  "Search Engine Roundtable",
  "Search Engine Watch Blog",
  "Valleywag",
  "Google Operating System",
  "A Consuming Experience (full feed)",
  "Google Blogoscoped",
  "O'Reilly Radar",
  "Topix.net Weblog",
  "Official Google Blog",
  "John Battelle's Searchblog",
  "TechCrunch",
  "GigaOM",
  "Matt Cutts: Gadgets, Google, and SEO"],
 ["Kotaku",
  "Michelle Malkin",
  "ongoing",
  "Deadspin",
  "Schneier on Security",
  "Wonkette",
  "Neil Gaiman's Journal",
  "Seth's Blog",
  "NewsBusters.org - Exposing Liberal Media Bias",
  "Creating Passionate Users",
  "Power Line",
  "Bloggers Blog: Blogging the Blogsphere",
  "Sifry's Alerts",
  "Shoemoney - Skills to pay the bills",
  "Signal vs. Noise",
  "SpikedHumor",
  "Oilman",
  "gapingvoid: \"cartoons drawn on the back of business cards\"",
  "Joi Ito's Web",
  "lifehack.org",
  "Joho the Blog",
  "Andrew Sullivan | The Daily Dish",
  "Instapundit.com",
  "Scobleizer - Tech Geek Blogger",
  "Boing Boing",
  "kottke.org",
  "plasticbag.org",
  "Derek Powazek",
  "Eschaton",
  "Think Progress",
  "Daily Kos",
  "The Viral Garden",
  "Talking Points Memo: by Joshua Micah Marshall",
  "we make money not art",
  "Gothamist",
  "Gizmodo",
  "Crooks and Liars",
  "Go Fug Yourself",
  "Gawker",
  "WWdN: In Exile",
  "BuzzMachine",
  "flagrantdisregard",
  "Little Green Footballs",
  "Captain's Quarters",
  "The Huffington Post | Raw Feed",
  "The Superficial - Because You're Ugly",
  "Jeremy Zawodny's blog",
  "MetaFilter",
  "The Blotter",
  "Hot Air",
  "PerezHilton.com",
  "Dave Shea's mezzoblue"]]

重心はランダムに配置されるので，毎回この結果になるとは限りません．ご注意を．