都元ダイスケ IT-PRESS このページをアンテナに追加 RSSフィード Twitter

最近は会社ブログしか書いてません。

2011-03-20

[]今度はMahoutでクラスタリング(ソース編)

Mahoutシリーズを最初から読む場合はこちらApache Mahoutで機械学習してみるべ - 都元ダイスケ IT-PRESS。前回はこちら今度はMahoutでクラスタリング - 都元ダイスケ IT-PRESS

準備

まずmvnの依存設定を。以前と同じようにmahout-coreは要ります。それに加えて*1slf4jとlogback*2、そしてcommons-io*3を入れておきます。

pom.xml
    <dependency>
      <groupId>org.apache.mahout</groupId>
      <artifactId>mahout-core</artifactId>
      <version>0.4</version>
    </dependency>

    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>slf4j-api</artifactId>
      <version>${lib.slf4j.version}</version>
    </dependency>
    <dependency>
      <groupId>org.slf4j</groupId>
      <artifactId>jcl-over-slf4j</artifactId>
      <version>${lib.slf4j.version}</version>
    </dependency>
    <dependency>
      <groupId>ch.qos.logback</groupId>
      <artifactId>logback-core</artifactId>
      <version>${lib.logback.version}</version>
    </dependency>
    <dependency>
      <groupId>ch.qos.logback</groupId>
      <artifactId>logback-classic</artifactId>
      <version>${lib.logback.version}</version>
    </dependency>
  
    <dependency>
      <groupId>commons-io</groupId>
      <artifactId>commons-io</artifactId>
      <version>2.0</version>
    </dependency>
  
...

  <properties>
    <lib.slf4j.version>1.6.0</lib.slf4j.version>
    <lib.logback.version>0.9.21</lib.logback.version>
  </properties>
logback.xml

で、ログ設定ファイルこんなんをsrc/main/resouces直下に置いておきましょう。

<?xml version="1.0" encoding="UTF-8"?>
<configuration>
  <appender name="STDOUT" class="ch.qos.logback.core.ConsoleAppender">
    <Target>System.out</Target>
    <layout class="ch.qos.logback.classic.PatternLayout">
      <Pattern>%d{HH:mm:ss.SSS} [%thread] %-5level %logger{36} - %msg%n</Pattern>
    </layout>
  </appender>

  <root>
    <level value="INFO" />
    <appender-ref ref="STDOUT" />
  </root>
</configuration>

Javaソース

やっと本質的なトコはいりますよー。とりあえず、サンプルコードでは、最後の3次元ベクトルをクラスタリングしてみましょう。

まずはクラスタリング対象のベクトル群を用意します。ここでは前回の3Dベクトル9つを使います。

static final double[][] points = {
    {8, 8, 8}, {8, 7.5, 9}, {7.7, 7.5, 9.8},
    {0, 7.5, 9}, {0.1, 8, 8}, {-1, 9, 7.5},
    {9, -1, -0.8}, {7.7, -1.2, -0.1}, {8.2, 0.2, 0.2},
};

で、今回のクラスタリングには k-means clastering という手法を使います。この手法では、あらかじめ「最終的にいくつのクラスタを作るのか」、という k の値を決めなければなりません。ここでは k = 3 として、3つのクラスタを作る前提でいきます。

Mahoutのクラスタリングでは、いきなりHadoopが出て来ます。とは言え、Hadoopクラスタを組む必要はなく、standaloneで走らせることはできます。その際「クラスタリングの対象となる9つのベクトル」と「3つのクラスタ」をあらかじめHDFS上にファイルとして配置する必要があります。これを writePointsToFile と writeClustersToFile メソッドで行っています。

そしてクラスタリングの処理を実行。クラスタリングの計算は、HDFSからデータを読み込み、そして結果もHDFSに書き込みます。従って、計算後にはHDFSを読み出す処理として readClusteredPointsFromFile を実行しています。

public static void main(String args[]) throws Exception {
    int k = 3;
    List<Vector> vectors = getPoints(points);
    Configuration conf = new Configuration();
    FileSystem fs = FileSystem.get(conf);
   
   // HDFSにベクトルとクラスタを書き込む
    writePointsToFile(vectors, "target/input/points/file1", fs, conf);
    writeClustersToFile(k, vectors, "target/input/clusters/part-00000", fs, conf);
   
   // クラスタリングを実行
    Path pointsPath = new Path("target/input/points");
    Path clustersPath = new Path("target/input/clusters");
    Path outputPath = new Path("target/output");
    KMeansDriver.run(conf, pointsPath, clustersPath, outputPath,
            new EuclideanDistanceMeasure(), 0.001, 10, true, false);
    
    // クラスタリングの結果をHDFSから読み出し、コンソールに表示する
    readClusteredPointsFromFile(fs, conf);
}

static List<Vector> getPoints(double[][] raw) {
    List<Vector> points = new ArrayList<Vector>();
    for (double[] fr : raw) {
        Vector vec = new RandomAccessSparseVector(fr.length);
        vec.assign(fr);
        points.add(vec);
    }
    return points;
}

static void writePointsToFile(List<Vector> points, String fileName, FileSystem fs, Configuration conf)
        throws IOException {
    Path path = new Path(fileName);
    SequenceFile.Writer writer = null;
    try {
        writer = new SequenceFile.Writer(fs, conf, path, LongWritable.class, VectorWritable.class);
        long recNum = 0;
        VectorWritable vec = new VectorWritable();
        for (Vector point : points) {
            vec.set(point);
            writer.append(new LongWritable(recNum++), vec);
        }
    } finally {
        IOUtils.closeQuietly(writer);
    }
}

static void writeClustersToFile(int k, List<Vector> vectors, String fileName, FileSystem fs, Configuration conf)
        throws IOException {
    Path path = new Path(fileName);
    SequenceFile.Writer writer = null;
    try {
        writer = new SequenceFile.Writer(fs, conf, path, Text.class, Cluster.class);
        for (int i = 0; i < k; i++) {
            Vector vec = vectors.get(i);
            Cluster cluster = new Cluster(vec, i, new EuclideanDistanceMeasure());
            writer.append(new Text(cluster.getIdentifier()), cluster);
        }
    } finally {
        IOUtils.closeQuietly(writer);
    }
}

static void readClusteredPointsFromFile(FileSystem fs, Configuration conf) throws IOException {
    Path path = new Path("target/output/" + Cluster.CLUSTERED_POINTS_DIR + "/part-m-00000");
    SequenceFile.Reader reader = null;
    try {
        reader = new SequenceFile.Reader(fs, path, conf);
        IntWritable key = new IntWritable();
        WeightedVectorWritable value = new WeightedVectorWritable();
        while (reader.next(key, value)) {
            System.out.println(value.toString() + " belongs to cluster " + key.toString());
        }
    } finally {
        IOUtils.closeQuietly(reader);
    }
}

参考までに、importはこちら。同じ単純名のクラスが意外とある。

import org.apache.commons.io.IOUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.IntWritable;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.SequenceFile;
import org.apache.hadoop.io.Text;
import org.apache.mahout.clustering.WeightedVectorWritable;
import org.apache.mahout.clustering.kmeans.Cluster;
import org.apache.mahout.clustering.kmeans.KMeansDriver;
import org.apache.mahout.common.distance.EuclideanDistanceMeasure;
import org.apache.mahout.math.RandomAccessSparseVector;
import org.apache.mahout.math.Vector;
import org.apache.mahout.math.VectorWritable;

結果

クラスタリングの結果は以下の通り。それぞれのベクトルが cluster 0 〜 cluster 2 に分類されていることが分かると思います。

1.0: [8.000, 8.000, 8.000] belongs to cluster 1
1.0: [8.000, 7.500, 9.000] belongs to cluster 1
1.0: [7.700, 7.500, 9.800] belongs to cluster 1
1.0: [1:7.500, 2:9.000] belongs to cluster 2
1.0: [0.100, 8.000, 8.000] belongs to cluster 2
1.0: [-1.000, 9.000, 7.500] belongs to cluster 2
1.0: [9.000, -1.000, -0.800] belongs to cluster 0
1.0: [7.700, -1.200, -0.100] belongs to cluster 0
1.0: [8.200, 0.200, 0.200] belongs to cluster 0

参考までに、結果を出す前にだーーっと流れるログはこんな感じ。Hadoopのジョブとして動いているのが分かると思います。

22:16:11.733 [main] INFO  o.a.m.clustering.kmeans.KMeansDriver - Input: target/input/points Clusters In: target/input/clusters Out: target/output Distance: org.apache.mahout.common.distance.EuclideanDistanceMeasure

22:16:11.738 [main] INFO  o.a.m.clustering.kmeans.KMeansDriver - convergence: 0.0010 max Iterations: 10 num Reduce Tasks: org.apache.mahout.math.VectorWritable Input Vectors: {}
22:16:11.739 [main] INFO  o.a.m.clustering.kmeans.KMeansDriver - K-Means Iteration 1
22:16:11.768 [main] INFO  o.a.hadoop.metrics.jvm.JvmMetrics - Initializing JVM Metrics with processName=JobTracker, sessionId=
22:16:11.878 [main] INFO  org.apache.mahout.common.HadoopUtil - Deleting target/output/clusters-1
22:16:11.885 [main] WARN  org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
22:16:12.497 [main] INFO  o.a.h.m.lib.input.FileInputFormat - Total input paths to process : 1
22:16:12.781 [main] INFO  org.apache.hadoop.mapred.JobClient - Running job: job_local_0001
22:16:12.787 [Thread-14] INFO  o.a.h.m.lib.input.FileInputFormat - Total input paths to process : 1
22:16:12.880 [Thread-14] INFO  org.apache.hadoop.mapred.MapTask - io.sort.mb = 100
22:16:17.532 [main] INFO  org.apache.hadoop.mapred.JobClient -  map 0% reduce 0%
22:16:17.534 [Thread-14] INFO  org.apache.hadoop.mapred.MapTask - data buffer = 79691776/99614720
22:16:17.535 [Thread-14] INFO  org.apache.hadoop.mapred.MapTask - record buffer = 262144/327680
22:16:17.646 [Thread-14] INFO  org.apache.hadoop.mapred.MapTask - Starting flush of map output
22:16:18.047 [Thread-14] INFO  org.apache.hadoop.mapred.MapTask - Finished spill 0
22:16:18.051 [Thread-14] INFO  org.apache.hadoop.mapred.TaskRunner - Task:attempt_local_0001_m_000000_0 is done. And is in the process of commiting
22:16:18.055 [Thread-14] INFO  o.a.hadoop.mapred.LocalJobRunner - 
22:16:18.055 [Thread-14] INFO  org.apache.hadoop.mapred.TaskRunner - Task 'attempt_local_0001_m_000000_0' done.
22:16:18.064 [Thread-14] INFO  o.a.hadoop.mapred.LocalJobRunner - 
22:16:18.072 [Thread-14] INFO  org.apache.hadoop.mapred.Merger - Merging 1 sorted segments
22:16:18.087 [Thread-14] INFO  org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 1 segments left of total size: 239 bytes
22:16:18.087 [Thread-14] INFO  o.a.hadoop.mapred.LocalJobRunner - 
22:16:18.184 [Thread-14] INFO  org.apache.hadoop.mapred.TaskRunner - Task:attempt_local_0001_r_000000_0 is done. And is in the process of commiting
22:16:18.185 [Thread-14] INFO  o.a.hadoop.mapred.LocalJobRunner - 
22:16:18.186 [Thread-14] INFO  org.apache.hadoop.mapred.TaskRunner - Task attempt_local_0001_r_000000_0 is allowed to commit now
22:16:18.190 [Thread-14] INFO  o.a.h.m.l.output.FileOutputCommitter - Saved output of task 'attempt_local_0001_r_000000_0' to target/output/clusters-1
22:16:18.191 [Thread-14] INFO  o.a.hadoop.mapred.LocalJobRunner - reduce > reduce
22:16:18.192 [Thread-14] INFO  org.apache.hadoop.mapred.TaskRunner - Task 'attempt_local_0001_r_000000_0' done.
22:16:18.535 [main] INFO  org.apache.hadoop.mapred.JobClient -  map 100% reduce 100%
22:16:18.535 [main] INFO  org.apache.hadoop.mapred.JobClient - Job complete: job_local_0001
22:16:18.537 [main] INFO  org.apache.hadoop.mapred.JobClient - Counters: 13
22:16:18.537 [main] INFO  org.apache.hadoop.mapred.JobClient -   Clustering
22:16:18.538 [main] INFO  org.apache.hadoop.mapred.JobClient -     Converged Clusters=1
22:16:18.538 [main] INFO  org.apache.hadoop.mapred.JobClient -   FileSystemCounters
22:16:18.538 [main] INFO  org.apache.hadoop.mapred.JobClient -     FILE_BYTES_READ=2741232
22:16:18.539 [main] INFO  org.apache.hadoop.mapred.JobClient -     FILE_BYTES_WRITTEN=2792502
22:16:18.539 [main] INFO  org.apache.hadoop.mapred.JobClient -   Map-Reduce Framework
22:16:18.539 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce input groups=3
22:16:18.540 [main] INFO  org.apache.hadoop.mapred.JobClient -     Combine output records=3
22:16:18.540 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map input records=9
22:16:18.541 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce shuffle bytes=0
22:16:18.541 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce output records=3
22:16:18.541 [main] INFO  org.apache.hadoop.mapred.JobClient -     Spilled Records=6
22:16:18.542 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map output bytes=675
22:16:18.542 [main] INFO  org.apache.hadoop.mapred.JobClient -     Combine input records=9
22:16:18.543 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map output records=9
22:16:18.543 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce input records=3
22:16:18.547 [main] INFO  o.a.m.clustering.kmeans.KMeansDriver - K-Means Iteration 2
22:16:18.548 [main] INFO  o.a.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
22:16:18.576 [main] INFO  org.apache.mahout.common.HadoopUtil - Deleting target/output/clusters-2
22:16:18.578 [main] WARN  org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
22:16:19.072 [main] INFO  o.a.h.m.lib.input.FileInputFormat - Total input paths to process : 1
22:16:19.630 [main] INFO  org.apache.hadoop.mapred.JobClient - Running job: job_local_0002
22:16:19.632 [Thread-28] INFO  o.a.h.m.lib.input.FileInputFormat - Total input paths to process : 1
22:16:20.622 [Thread-28] INFO  org.apache.hadoop.mapred.MapTask - io.sort.mb = 100
22:16:20.719 [main] INFO  org.apache.hadoop.mapred.JobClient -  map 0% reduce 0%
22:16:22.272 [Thread-28] INFO  org.apache.hadoop.mapred.MapTask - data buffer = 79691776/99614720
22:16:22.273 [Thread-28] INFO  org.apache.hadoop.mapred.MapTask - record buffer = 262144/327680
22:16:22.321 [Thread-28] INFO  org.apache.hadoop.mapred.MapTask - Starting flush of map output
22:16:22.323 [Thread-28] INFO  org.apache.hadoop.mapred.MapTask - Finished spill 0
22:16:22.326 [Thread-28] INFO  org.apache.hadoop.mapred.TaskRunner - Task:attempt_local_0002_m_000000_0 is done. And is in the process of commiting
22:16:22.327 [Thread-28] INFO  o.a.hadoop.mapred.LocalJobRunner - 
22:16:22.327 [Thread-28] INFO  org.apache.hadoop.mapred.TaskRunner - Task 'attempt_local_0002_m_000000_0' done.
22:16:22.358 [Thread-28] INFO  o.a.hadoop.mapred.LocalJobRunner - 
22:16:22.360 [Thread-28] INFO  org.apache.hadoop.mapred.Merger - Merging 1 sorted segments
22:16:22.360 [Thread-28] INFO  org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 1 segments left of total size: 239 bytes
22:16:22.361 [Thread-28] INFO  o.a.hadoop.mapred.LocalJobRunner - 
22:16:22.428 [Thread-28] INFO  org.apache.hadoop.mapred.TaskRunner - Task:attempt_local_0002_r_000000_0 is done. And is in the process of commiting
22:16:22.429 [Thread-28] INFO  o.a.hadoop.mapred.LocalJobRunner - 
22:16:22.430 [Thread-28] INFO  org.apache.hadoop.mapred.TaskRunner - Task attempt_local_0002_r_000000_0 is allowed to commit now
22:16:22.434 [Thread-28] INFO  o.a.h.m.l.output.FileOutputCommitter - Saved output of task 'attempt_local_0002_r_000000_0' to target/output/clusters-2
22:16:22.435 [Thread-28] INFO  o.a.hadoop.mapred.LocalJobRunner - reduce > reduce
22:16:22.436 [Thread-28] INFO  org.apache.hadoop.mapred.TaskRunner - Task 'attempt_local_0002_r_000000_0' done.
22:16:23.265 [main] INFO  org.apache.hadoop.mapred.JobClient -  map 100% reduce 100%
22:16:23.266 [main] INFO  org.apache.hadoop.mapred.JobClient - Job complete: job_local_0002
22:16:23.266 [main] INFO  org.apache.hadoop.mapred.JobClient - Counters: 12
22:16:23.267 [main] INFO  org.apache.hadoop.mapred.JobClient -   FileSystemCounters
22:16:23.267 [main] INFO  org.apache.hadoop.mapred.JobClient -     FILE_BYTES_READ=5484503
22:16:23.267 [main] INFO  org.apache.hadoop.mapred.JobClient -     FILE_BYTES_WRITTEN=5583630
22:16:23.267 [main] INFO  org.apache.hadoop.mapred.JobClient -   Map-Reduce Framework
22:16:23.268 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce input groups=3
22:16:23.268 [main] INFO  org.apache.hadoop.mapred.JobClient -     Combine output records=3
22:16:23.268 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map input records=9
22:16:23.269 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce shuffle bytes=0
22:16:23.269 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce output records=3
22:16:23.269 [main] INFO  org.apache.hadoop.mapred.JobClient -     Spilled Records=6
22:16:23.269 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map output bytes=675
22:16:23.269 [main] INFO  org.apache.hadoop.mapred.JobClient -     Combine input records=9
22:16:23.269 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map output records=9
22:16:23.270 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce input records=3
22:16:23.273 [main] INFO  o.a.m.clustering.kmeans.KMeansDriver - K-Means Iteration 3
22:16:23.274 [main] INFO  o.a.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
22:16:23.289 [main] INFO  org.apache.mahout.common.HadoopUtil - Deleting target/output/clusters-3
22:16:23.291 [main] WARN  org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
22:16:23.496 [main] INFO  o.a.h.m.lib.input.FileInputFormat - Total input paths to process : 1
22:16:24.679 [Thread-41] INFO  o.a.h.m.lib.input.FileInputFormat - Total input paths to process : 1
22:16:24.690 [main] INFO  org.apache.hadoop.mapred.JobClient - Running job: job_local_0003
22:16:24.729 [Thread-41] INFO  org.apache.hadoop.mapred.MapTask - io.sort.mb = 100
22:16:25.043 [Thread-41] INFO  org.apache.hadoop.mapred.MapTask - data buffer = 79691776/99614720
22:16:25.044 [Thread-41] INFO  org.apache.hadoop.mapred.MapTask - record buffer = 262144/327680
22:16:25.101 [Thread-41] INFO  org.apache.hadoop.mapred.MapTask - Starting flush of map output
22:16:25.103 [Thread-41] INFO  org.apache.hadoop.mapred.MapTask - Finished spill 0
22:16:25.106 [Thread-41] INFO  org.apache.hadoop.mapred.TaskRunner - Task:attempt_local_0003_m_000000_0 is done. And is in the process of commiting
22:16:25.107 [Thread-41] INFO  o.a.hadoop.mapred.LocalJobRunner - 
22:16:25.107 [Thread-41] INFO  org.apache.hadoop.mapred.TaskRunner - Task 'attempt_local_0003_m_000000_0' done.
22:16:25.113 [Thread-41] INFO  o.a.hadoop.mapred.LocalJobRunner - 
22:16:25.114 [Thread-41] INFO  org.apache.hadoop.mapred.Merger - Merging 1 sorted segments
22:16:25.115 [Thread-41] INFO  org.apache.hadoop.mapred.Merger - Down to the last merge-pass, with 1 segments left of total size: 239 bytes
22:16:25.115 [Thread-41] INFO  o.a.hadoop.mapred.LocalJobRunner - 
22:16:25.190 [Thread-41] INFO  org.apache.hadoop.mapred.TaskRunner - Task:attempt_local_0003_r_000000_0 is done. And is in the process of commiting
22:16:25.191 [Thread-41] INFO  o.a.hadoop.mapred.LocalJobRunner - 
22:16:25.191 [Thread-41] INFO  org.apache.hadoop.mapred.TaskRunner - Task attempt_local_0003_r_000000_0 is allowed to commit now
22:16:25.195 [Thread-41] INFO  o.a.h.m.l.output.FileOutputCommitter - Saved output of task 'attempt_local_0003_r_000000_0' to target/output/clusters-3
22:16:25.196 [Thread-41] INFO  o.a.hadoop.mapred.LocalJobRunner - reduce > reduce
22:16:25.196 [Thread-41] INFO  org.apache.hadoop.mapred.TaskRunner - Task 'attempt_local_0003_r_000000_0' done.
22:16:25.702 [main] INFO  org.apache.hadoop.mapred.JobClient -  map 100% reduce 100%
22:16:25.703 [main] INFO  org.apache.hadoop.mapred.JobClient - Job complete: job_local_0003
22:16:25.704 [main] INFO  org.apache.hadoop.mapred.JobClient - Counters: 13
22:16:25.704 [main] INFO  org.apache.hadoop.mapred.JobClient -   Clustering
22:16:25.705 [main] INFO  org.apache.hadoop.mapred.JobClient -     Converged Clusters=3
22:16:25.705 [main] INFO  org.apache.hadoop.mapred.JobClient -   FileSystemCounters
22:16:25.705 [main] INFO  org.apache.hadoop.mapred.JobClient -     FILE_BYTES_READ=8227859
22:16:25.706 [main] INFO  org.apache.hadoop.mapred.JobClient -     FILE_BYTES_WRITTEN=8374758
22:16:25.706 [main] INFO  org.apache.hadoop.mapred.JobClient -   Map-Reduce Framework
22:16:25.706 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce input groups=3
22:16:25.706 [main] INFO  org.apache.hadoop.mapred.JobClient -     Combine output records=3
22:16:25.707 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map input records=9
22:16:25.707 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce shuffle bytes=0
22:16:25.707 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce output records=3
22:16:25.708 [main] INFO  org.apache.hadoop.mapred.JobClient -     Spilled Records=6
22:16:25.709 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map output bytes=675
22:16:25.709 [main] INFO  org.apache.hadoop.mapred.JobClient -     Combine input records=9
22:16:25.710 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map output records=9
22:16:25.710 [main] INFO  org.apache.hadoop.mapred.JobClient -     Reduce input records=3
22:16:25.713 [main] INFO  o.a.m.clustering.kmeans.KMeansDriver - Clustering data
22:16:25.714 [main] INFO  o.a.m.clustering.kmeans.KMeansDriver - Running Clustering
22:16:25.714 [main] INFO  o.a.m.clustering.kmeans.KMeansDriver - Input: target/input/points Clusters In: target/output/clusters-3 Out: target/output/clusteredPoints Distance: org.apache.mahout.common.distance.EuclideanDistanceMeasure@343a9d95
22:16:25.714 [main] INFO  o.a.m.clustering.kmeans.KMeansDriver - convergence: 0.0010 Input Vectors: org.apache.mahout.math.VectorWritable
22:16:25.714 [main] INFO  o.a.hadoop.metrics.jvm.JvmMetrics - Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized
22:16:25.730 [main] INFO  org.apache.mahout.common.HadoopUtil - Deleting target/output/clusteredPoints
22:16:25.732 [main] WARN  org.apache.hadoop.mapred.JobClient - Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
22:16:25.932 [main] INFO  o.a.h.m.lib.input.FileInputFormat - Total input paths to process : 1
22:16:26.259 [main] INFO  org.apache.hadoop.mapred.JobClient - Running job: job_local_0004
22:16:26.270 [Thread-54] INFO  o.a.h.m.lib.input.FileInputFormat - Total input paths to process : 1
22:16:26.404 [Thread-54] INFO  org.apache.hadoop.mapred.TaskRunner - Task:attempt_local_0004_m_000000_0 is done. And is in the process of commiting
22:16:26.405 [Thread-54] INFO  o.a.hadoop.mapred.LocalJobRunner - 
22:16:26.405 [Thread-54] INFO  org.apache.hadoop.mapred.TaskRunner - Task attempt_local_0004_m_000000_0 is allowed to commit now
22:16:26.410 [Thread-54] INFO  o.a.h.m.l.output.FileOutputCommitter - Saved output of task 'attempt_local_0004_m_000000_0' to target/output/clusteredPoints
22:16:26.411 [Thread-54] INFO  o.a.hadoop.mapred.LocalJobRunner - 
22:16:26.411 [Thread-54] INFO  org.apache.hadoop.mapred.TaskRunner - Task 'attempt_local_0004_m_000000_0' done.
22:16:27.261 [main] INFO  org.apache.hadoop.mapred.JobClient -  map 100% reduce 0%
22:16:27.261 [main] INFO  org.apache.hadoop.mapred.JobClient - Job complete: job_local_0004
22:16:27.262 [main] INFO  org.apache.hadoop.mapred.JobClient - Counters: 5
22:16:27.262 [main] INFO  org.apache.hadoop.mapred.JobClient -   FileSystemCounters
22:16:27.262 [main] INFO  org.apache.hadoop.mapred.JobClient -     FILE_BYTES_READ=5484682
22:16:27.262 [main] INFO  org.apache.hadoop.mapred.JobClient -     FILE_BYTES_WRITTEN=5581897
22:16:27.262 [main] INFO  org.apache.hadoop.mapred.JobClient -   Map-Reduce Framework
22:16:27.263 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map input records=9
22:16:27.263 [main] INFO  org.apache.hadoop.mapred.JobClient -     Spilled Records=0
22:16:27.263 [main] INFO  org.apache.hadoop.mapred.JobClient -     Map output records=9

*1:以下は俺の趣味なので、必須のライブラリではありませんが。

*2:ログ出力の俺好み設定ファイルをこっちで作っているからです。無くてもよいです。その場合、以下のlogback.xmlは不要です。ただし、さらにその下に示すログ出力は別の表記に変わります。

*3:IOUtil.closeQuietlyのためだけに入ってます。

スパム対策のためのダミーです。もし見えても何も入力しないでください
ゲスト

コメントを書くには、なぞなぞ認証に回答する必要があります。

トラックバック - http://d.hatena.ne.jp/daisuke-m/20110320/1300630503
リンク元