org.apache.mahout.clustering
Class ClusteringUtils

java.lang.Object
  extended by org.apache.mahout.clustering.ClusteringUtils

public final class ClusteringUtils
extends Object


Method Summary
static double choose2(double n)
           
static double daviesBouldinIndex(List<? extends Vector> centroids, DistanceMeasure distanceMeasure, List<OnlineSummarizer> clusterDistanceSummaries)
          Computes the Davies-Bouldin Index for a given clustering.
static double dunnIndex(List<? extends Vector> centroids, DistanceMeasure distanceMeasure, List<OnlineSummarizer> clusterDistanceSummaries)
          Computes the Dunn Index of a given clustering.
static double estimateDistanceCutoff(Iterable<? extends Vector> data, DistanceMeasure distanceMeasure, int sampleLimit)
           
static double estimateDistanceCutoff(List<? extends Vector> data, DistanceMeasure distanceMeasure)
          Estimates the distance cutoff.
static double getAdjustedRandIndex(Matrix confusionMatrix)
          Computes the Adjusted Rand Index for a given confusion matrix.
static Matrix getConfusionMatrix(List<? extends Vector> rowCentroids, List<? extends Vector> columnCentroids, Iterable<? extends Vector> datapoints, DistanceMeasure distanceMeasure)
          Creates a confusion matrix by searching for the closest cluster of both the row clustering and column clustering of a point and adding its weight to that cell of the matrix.
static List<OnlineSummarizer> summarizeClusterDistances(Iterable<? extends Vector> datapoints, Iterable<? extends Vector> centroids, DistanceMeasure distanceMeasure)
          Computes the summaries for the distances in each cluster.
static double totalClusterCost(Iterable<? extends Vector> datapoints, Iterable<? extends Vector> centroids)
          Adds up the distances from each point to its closest cluster and returns the sum.
static double totalClusterCost(Iterable<? extends Vector> datapoints, Searcher centroids)
          Adds up the distances from each point to its closest cluster and returns the sum.
static double totalWeight(Iterable<? extends Vector> data)
          Computes the total weight of the points in the given Vector iterable.
 
Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait
 

Method Detail

summarizeClusterDistances

public static List<OnlineSummarizer> summarizeClusterDistances(Iterable<? extends Vector> datapoints,
                                                               Iterable<? extends Vector> centroids,
                                                               DistanceMeasure distanceMeasure)
Computes the summaries for the distances in each cluster.

Parameters:
datapoints - iterable of datapoints.
centroids - iterable of Centroids.
Returns:
a list of OnlineSummarizers where the i-th element is the summarizer corresponding to the cluster whose index is i.

totalClusterCost

public static double totalClusterCost(Iterable<? extends Vector> datapoints,
                                      Iterable<? extends Vector> centroids)
Adds up the distances from each point to its closest cluster and returns the sum.

Parameters:
datapoints - iterable of datapoints.
centroids - iterable of Centroids.
Returns:
the total cost described above.

totalClusterCost

public static double totalClusterCost(Iterable<? extends Vector> datapoints,
                                      Searcher centroids)
Adds up the distances from each point to its closest cluster and returns the sum.

Parameters:
datapoints - iterable of datapoints.
centroids - searcher of Centroids.
Returns:
the total cost described above.

estimateDistanceCutoff

public static double estimateDistanceCutoff(List<? extends Vector> data,
                                            DistanceMeasure distanceMeasure)
Estimates the distance cutoff. In StreamingKMeans, the distance between two vectors divided by this value is used as a probability threshold when deciding whether to form a new cluster or not. Small values (comparable to the minimum distance between two points) are preferred as they guarantee with high likelihood that all but very close points are put in separate clusters initially. The clusters themselves are actually collapsed periodically when their number goes over the maximum number of clusters and the distanceCutoff is increased. So, the returned value is only an initial estimate.

Parameters:
data - the datapoints whose distance is to be estimated.
distanceMeasure - the distance measure used to compute the distance between two points.
Returns:
the minimum distance between the first sampleLimit points
See Also:
StreamingKMeans.clusterInternal(Iterable, boolean)

estimateDistanceCutoff

public static double estimateDistanceCutoff(Iterable<? extends Vector> data,
                                            DistanceMeasure distanceMeasure,
                                            int sampleLimit)

daviesBouldinIndex

public static double daviesBouldinIndex(List<? extends Vector> centroids,
                                        DistanceMeasure distanceMeasure,
                                        List<OnlineSummarizer> clusterDistanceSummaries)
Computes the Davies-Bouldin Index for a given clustering. See http://en.wikipedia.org/wiki/Clustering_algorithm#Internal_evaluation

Parameters:
centroids - list of centroids
distanceMeasure - distance measure for inter-cluster distances
clusterDistanceSummaries - summaries of the clusters; See summarizeClusterDistances
Returns:
the Davies-Bouldin Index

dunnIndex

public static double dunnIndex(List<? extends Vector> centroids,
                               DistanceMeasure distanceMeasure,
                               List<OnlineSummarizer> clusterDistanceSummaries)
Computes the Dunn Index of a given clustering. See http://en.wikipedia.org/wiki/Dunn_index

Parameters:
centroids - list of centroids
distanceMeasure - distance measure to compute inter-centroid distance with
clusterDistanceSummaries - summaries of the clusters; See summarizeClusterDistances
Returns:
the Dunn Index

choose2

public static double choose2(double n)

getConfusionMatrix

public static Matrix getConfusionMatrix(List<? extends Vector> rowCentroids,
                                        List<? extends Vector> columnCentroids,
                                        Iterable<? extends Vector> datapoints,
                                        DistanceMeasure distanceMeasure)
Creates a confusion matrix by searching for the closest cluster of both the row clustering and column clustering of a point and adding its weight to that cell of the matrix. It doesn't matter which clustering is the row clustering and which is the column clustering. If they're interchanged, the resulting matrix is the transpose of the original one.

Parameters:
rowCentroids - clustering one
columnCentroids - clustering two
datapoints - datapoints whose closest cluster we need to find
distanceMeasure - distance measure to use
Returns:
the confusion matrix

getAdjustedRandIndex

public static double getAdjustedRandIndex(Matrix confusionMatrix)
Computes the Adjusted Rand Index for a given confusion matrix.

Parameters:
confusionMatrix - confusion matrix; not to be confused with the more restrictive ConfusionMatrix class
Returns:
the Adjusted Rand Index

totalWeight

public static double totalWeight(Iterable<? extends Vector> data)
Computes the total weight of the points in the given Vector iterable.

Parameters:
data - iterable of points
Returns:
total weight


Copyright © 2008–2014 The Apache Software Foundation. All rights reserved.