weka.clusterers
Class SimpleKMeans

java.lang.Object
  extended by weka.clusterers.AbstractClusterer
      extended by weka.clusterers.RandomizableClusterer
          extended by weka.clusterers.SimpleKMeans
All Implemented Interfaces:
Serializable, Cloneable, Clusterer, NumberOfClustersRequestable, CapabilitiesHandler, OptionHandler, Randomizable, RevisionHandler, TechnicalInformationHandler, WeightedInstancesHandler

public class SimpleKMeans
extends RandomizableClusterer
implements NumberOfClustersRequestable, WeightedInstancesHandler, TechnicalInformationHandler

Cluster data using the k means algorithm. Can use either the Euclidean distance (default) or the Manhattan distance. If the Manhattan distance is used, then centroids are computed as the component-wise median rather than mean. For more information see:

D. Arthur, S. Vassilvitskii: k-means++: the advantages of carefull seeding. In: Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms, 1027-1035, 2007.

BibTeX:

 @inproceedings{Arthur2007,
    author = {D. Arthur and S. Vassilvitskii},
    booktitle = {Proceedings of the eighteenth annual ACM-SIAM symposium on Discrete algorithms},
    pages = {1027-1035},
    title = {k-means++: the advantages of carefull seeding},
    year = {2007}
 }
 

Valid options are:

 -N <num>
  number of clusters.
  (default 2).
 -P
  Initialize using the k-means++ method.
 
 -V
  Display std. deviations for centroids.
 
 -M
  Replace missing values with mean/mode.
 
 -A <classname and options>
  Distance function to use.
  (default: weka.core.EuclideanDistance)
 -I <num>
  Maximum number of iterations.
 
 -O
  Preserve order of instances.
 
 -fast
  Enables faster distance calculations, using cut-off values.
  Disables the calculation/output of squared errors/distances.
 
 -S <num>
  Random number seed.
  (default 10)

Version:
$Revision: 8034 $
Author:
Mark Hall (mhall@cs.waikato.ac.nz), Eibe Frank (eibe@cs.waikato.ac.nz)
See Also:
RandomizableClusterer, Serialized Form

Constructor Summary
SimpleKMeans()
          the default constructor.
 
Method Summary
 void buildClusterer(Instances data)
          Generates a clusterer.
 int clusterInstance(Instance instance)
          Classifies a given instance.
 String displayStdDevsTipText()
          Returns the tip text for this property.
 String distanceFunctionTipText()
          Returns the tip text for this property.
 String dontReplaceMissingValuesTipText()
          Returns the tip text for this property.
 String fastDistanceCalcTipText()
          Returns the tip text for this property.
 int[] getAssignments()
          Gets the assignments for each instance.
 Capabilities getCapabilities()
          Returns default capabilities of the clusterer.
 Instances getClusterCentroids()
          Gets the the cluster centroids.
 int[][][] getClusterNominalCounts()
          Returns for each cluster the frequency counts for the values of each nominal attribute.
 int[] getClusterSizes()
          Gets the number of instances in each cluster.
 Instances getClusterStandardDevs()
          Gets the standard deviations of the numeric attributes in each cluster.
 boolean getDisplayStdDevs()
          Gets whether standard deviations and nominal count.
 DistanceFunction getDistanceFunction()
          returns the distance function currently in use.
 boolean getDontReplaceMissingValues()
          Gets whether missing values are to be replaced.
 boolean getFastDistanceCalc()
          Gets whether to use faster distance calculation.
 boolean getInitializeUsingKMeansPlusPlusMethod()
          Get whether to initialize using the probabilistic farthest first like method of the k-means++ algorithm (rather than the standard random selection of initial cluster centers).
 int getMaxIterations()
          gets the number of maximum iterations to be executed.
 int getNumClusters()
          gets the number of clusters to generate.
 String[] getOptions()
          Gets the current settings of SimpleKMeans.
 boolean getPreserveInstancesOrder()
          Gets whether order of instances must be preserved.
 String getRevision()
          Returns the revision string.
 double getSquaredError()
          Gets the squared error for all clusters.
 TechnicalInformation getTechnicalInformation()
          Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.
 String globalInfo()
          Returns a string describing this clusterer.
 String initializeUsingKMeansPlusPlusMethodTipText()
          Returns the tip text for this property.
 Enumeration listOptions()
          Returns an enumeration describing the available options.
static void main(String[] args)
          Main method for executing this class.
 String maxIterationsTipText()
          Returns the tip text for this property.
 int numberOfClusters()
          Returns the number of clusters.
 String numClustersTipText()
          Returns the tip text for this property.
 String preserveInstancesOrderTipText()
          Returns the tip text for this property.
 void setDisplayStdDevs(boolean stdD)
          Sets whether standard deviations and nominal count.
 void setDistanceFunction(DistanceFunction df)
          sets the distance function to use for instance comparison.
 void setDontReplaceMissingValues(boolean r)
          Sets whether missing values are to be replaced.
 void setFastDistanceCalc(boolean value)
          Sets whether to use faster distance calculation.
 void setInitializeUsingKMeansPlusPlusMethod(boolean k)
          Set whether to initialize using the probabilistic farthest first like method of the k-means++ algorithm (rather than the standard random selection of initial cluster centers).
 void setMaxIterations(int n)
          set the maximum number of iterations to be executed.
 void setNumClusters(int n)
          set the number of clusters to generate.
 void setOptions(String[] options)
          Parses a given list of options.
 void setPreserveInstancesOrder(boolean r)
          Sets whether order of instances must be preserved.
 String toString()
          return a string describing this clusterer.
 
Methods inherited from class weka.clusterers.RandomizableClusterer
getSeed, seedTipText, setSeed
 
Methods inherited from class weka.clusterers.AbstractClusterer
distributionForInstance, forName, makeCopies, makeCopy, runClusterer
 
Methods inherited from class java.lang.Object
equals, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Constructor Detail

SimpleKMeans

public SimpleKMeans()
the default constructor.

Method Detail

getTechnicalInformation

public TechnicalInformation getTechnicalInformation()
Description copied from interface: TechnicalInformationHandler
Returns an instance of a TechnicalInformation object, containing detailed information about the technical background of this class, e.g., paper reference or book this class is based on.

Specified by:
getTechnicalInformation in interface TechnicalInformationHandler
Returns:
the technical information about this class

globalInfo

public String globalInfo()
Returns a string describing this clusterer.

Returns:
a description of the evaluator suitable for displaying in the explorer/experimenter gui

getCapabilities

public Capabilities getCapabilities()
Returns default capabilities of the clusterer.

Specified by:
getCapabilities in interface Clusterer
Specified by:
getCapabilities in interface CapabilitiesHandler
Overrides:
getCapabilities in class AbstractClusterer
Returns:
the capabilities of this clusterer
See Also:
Capabilities

buildClusterer

public void buildClusterer(Instances data)
                    throws Exception
Generates a clusterer. Has to initialize all fields of the clusterer that are not being set via options.

Specified by:
buildClusterer in interface Clusterer
Specified by:
buildClusterer in class AbstractClusterer
Parameters:
data - set of instances serving as training data
Throws:
Exception - if the clusterer has not been generated successfully

clusterInstance

public int clusterInstance(Instance instance)
                    throws Exception
Classifies a given instance.

Specified by:
clusterInstance in interface Clusterer
Overrides:
clusterInstance in class AbstractClusterer
Parameters:
instance - the instance to be assigned to a cluster
Returns:
the number of the assigned cluster as an interger if the class is enumerated, otherwise the predicted value
Throws:
Exception - if instance could not be classified successfully

numberOfClusters

public int numberOfClusters()
                     throws Exception
Returns the number of clusters.

Specified by:
numberOfClusters in interface Clusterer
Specified by:
numberOfClusters in class AbstractClusterer
Returns:
the number of clusters generated for a training dataset.
Throws:
Exception - if number of clusters could not be returned successfully

listOptions

public Enumeration listOptions()
Returns an enumeration describing the available options.

Specified by:
listOptions in interface OptionHandler
Overrides:
listOptions in class RandomizableClusterer
Returns:
an enumeration of all the available options.

numClustersTipText

public String numClustersTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setNumClusters

public void setNumClusters(int n)
                    throws Exception
set the number of clusters to generate.

Specified by:
setNumClusters in interface NumberOfClustersRequestable
Parameters:
n - the number of clusters to generate
Throws:
Exception - if number of clusters is negative

getNumClusters

public int getNumClusters()
gets the number of clusters to generate.

Returns:
the number of clusters to generate

initializeUsingKMeansPlusPlusMethodTipText

public String initializeUsingKMeansPlusPlusMethodTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setInitializeUsingKMeansPlusPlusMethod

public void setInitializeUsingKMeansPlusPlusMethod(boolean k)
Set whether to initialize using the probabilistic farthest first like method of the k-means++ algorithm (rather than the standard random selection of initial cluster centers).

Parameters:
k - true if the k-means++ method is to be used to select initial cluster centers.

getInitializeUsingKMeansPlusPlusMethod

public boolean getInitializeUsingKMeansPlusPlusMethod()
Get whether to initialize using the probabilistic farthest first like method of the k-means++ algorithm (rather than the standard random selection of initial cluster centers).

Returns:
true if the k-means++ method is to be used to select initial cluster centers.

maxIterationsTipText

public String maxIterationsTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setMaxIterations

public void setMaxIterations(int n)
                      throws Exception
set the maximum number of iterations to be executed.

Parameters:
n - the maximum number of iterations
Throws:
Exception - if maximum number of iteration is smaller than 1

getMaxIterations

public int getMaxIterations()
gets the number of maximum iterations to be executed.

Returns:
the number of clusters to generate

displayStdDevsTipText

public String displayStdDevsTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setDisplayStdDevs

public void setDisplayStdDevs(boolean stdD)
Sets whether standard deviations and nominal count. Should be displayed in the clustering output.

Parameters:
stdD - true if std. devs and counts should be displayed

getDisplayStdDevs

public boolean getDisplayStdDevs()
Gets whether standard deviations and nominal count. Should be displayed in the clustering output.

Returns:
true if std. devs and counts should be displayed

dontReplaceMissingValuesTipText

public String dontReplaceMissingValuesTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setDontReplaceMissingValues

public void setDontReplaceMissingValues(boolean r)
Sets whether missing values are to be replaced.

Parameters:
r - true if missing values are to be replaced

getDontReplaceMissingValues

public boolean getDontReplaceMissingValues()
Gets whether missing values are to be replaced.

Returns:
true if missing values are to be replaced

distanceFunctionTipText

public String distanceFunctionTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

getDistanceFunction

public DistanceFunction getDistanceFunction()
returns the distance function currently in use.

Returns:
the distance function

setDistanceFunction

public void setDistanceFunction(DistanceFunction df)
                         throws Exception
sets the distance function to use for instance comparison.

Parameters:
df - the new distance function to use
Throws:
Exception - if instances cannot be processed

preserveInstancesOrderTipText

public String preserveInstancesOrderTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setPreserveInstancesOrder

public void setPreserveInstancesOrder(boolean r)
Sets whether order of instances must be preserved.

Parameters:
r - true if missing values are to be replaced

getPreserveInstancesOrder

public boolean getPreserveInstancesOrder()
Gets whether order of instances must be preserved.

Returns:
true if missing values are to be replaced

fastDistanceCalcTipText

public String fastDistanceCalcTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the explorer/experimenter gui

setFastDistanceCalc

public void setFastDistanceCalc(boolean value)
Sets whether to use faster distance calculation.

Parameters:
value - true if faster calculation to be used

getFastDistanceCalc

public boolean getFastDistanceCalc()
Gets whether to use faster distance calculation.

Returns:
true if faster calculation is used

setOptions

public void setOptions(String[] options)
                throws Exception
Parses a given list of options.

Valid options are:

 -N <num>
  number of clusters.
  (default 2).
 -P
  Initialize using the k-means++ method.
 
 -V
  Display std. deviations for centroids.
 
 -M
  Replace missing values with mean/mode.
 
 -A <classname and options>
  Distance function to use.
  (default: weka.core.EuclideanDistance)
 -I <num>
  Maximum number of iterations.
 
 -O
  Preserve order of instances.
 
 -fast
  Enables faster distance calculations, using cut-off values.
  Disables the calculation/output of squared errors/distances.
 
 -S <num>
  Random number seed.
  (default 10)

Specified by:
setOptions in interface OptionHandler
Overrides:
setOptions in class RandomizableClusterer
Parameters:
options - the list of options as an array of strings
Throws:
Exception - if an option is not supported

getOptions

public String[] getOptions()
Gets the current settings of SimpleKMeans.

Specified by:
getOptions in interface OptionHandler
Overrides:
getOptions in class RandomizableClusterer
Returns:
an array of strings suitable for passing to setOptions()

toString

public String toString()
return a string describing this clusterer.

Overrides:
toString in class Object
Returns:
a description of the clusterer as a string

getClusterCentroids

public Instances getClusterCentroids()
Gets the the cluster centroids.

Returns:
the cluster centroids

getClusterStandardDevs

public Instances getClusterStandardDevs()
Gets the standard deviations of the numeric attributes in each cluster.

Returns:
the standard deviations of the numeric attributes in each cluster

getClusterNominalCounts

public int[][][] getClusterNominalCounts()
Returns for each cluster the frequency counts for the values of each nominal attribute.

Returns:
the counts

getSquaredError

public double getSquaredError()
Gets the squared error for all clusters.

Returns:
the squared error, NaN if fast distance calculation is used
See Also:
m_FastDistanceCalc

getClusterSizes

public int[] getClusterSizes()
Gets the number of instances in each cluster.

Returns:
The number of instances in each cluster

getAssignments

public int[] getAssignments()
                     throws Exception
Gets the assignments for each instance.

Returns:
Array of indexes of the centroid assigned to each instance
Throws:
Exception - if order of instances wasn't preserved or no assignments were made

getRevision

public String getRevision()
Returns the revision string.

Specified by:
getRevision in interface RevisionHandler
Overrides:
getRevision in class AbstractClusterer
Returns:
the revision

main

public static void main(String[] args)
Main method for executing this class.

Parameters:
args - use -h to list all parameters


Copyright © 2012 University of Waikato, Hamilton, NZ. All Rights Reserved.