Class SAXKMeans

  • All Implemented Interfaces:
    Serializable, Cloneable, weka.clusterers.Clusterer, weka.clusterers.NumberOfClustersRequestable, weka.core.CapabilitiesHandler, weka.core.CapabilitiesIgnorer, weka.core.CommandlineRunnable, weka.core.OptionHandler, weka.core.Randomizable, weka.core.RevisionHandler, weka.core.TechnicalInformationHandler, weka.core.WeightedInstancesHandler

    public class SAXKMeans
    extends weka.clusterers.RandomizableClusterer
    implements weka.clusterers.NumberOfClustersRequestable, weka.core.WeightedInstancesHandler, weka.core.TechnicalInformationHandler
    SimpleKMeans adapted for SAX.
    Version:
    $Revision$
    Author:
    fracpete (fracpete at waikato dot ac dot nz)
    See Also:
    Serialized Form
    • Field Detail

      • m_ReplaceMissingFilter

        protected weka.filters.unsupervised.attribute.ReplaceMissingValues m_ReplaceMissingFilter
        replace missing values in training instances.
      • m_NumClusters

        protected int m_NumClusters
        number of clusters to generate.
      • m_initialStartPoints

        protected weka.core.Instances m_initialStartPoints
        Holds the initial start points, as supplied by the initialization method used
      • m_ClusterCentroids

        protected weka.core.Instances m_ClusterCentroids
        holds the cluster centroids.
      • m_ClusterStdDevs

        protected weka.core.Instances m_ClusterStdDevs
        Holds the standard deviations of the numeric attributes in each cluster.
      • m_ClusterNominalCounts

        protected int[][][] m_ClusterNominalCounts
        For each cluster, holds the frequency counts for the values of each nominal attribute.
      • m_ClusterMissingCounts

        protected int[][] m_ClusterMissingCounts
      • m_FullMeansOrMediansOrModes

        protected double[] m_FullMeansOrMediansOrModes
        Stats on the full data set for comparison purposes. In case the attribute is numeric the value is the mean if is being used the Euclidian distance or the median if Manhattan distance and if the attribute is nominal then it's mode is saved.
      • m_FullStdDevs

        protected double[] m_FullStdDevs
      • m_FullNominalCounts

        protected int[][] m_FullNominalCounts
      • m_FullMissingCounts

        protected int[] m_FullMissingCounts
      • m_displayStdDevs

        protected boolean m_displayStdDevs
        Display standard deviations for numeric atts.
      • m_dontReplaceMissing

        protected boolean m_dontReplaceMissing
        Replace missing values globally?
      • m_ClusterSizes

        protected int[] m_ClusterSizes
        The number of instances in each cluster.
      • m_MaxIterations

        protected int m_MaxIterations
        Maximum number of iterations to be executed.
      • m_Iterations

        protected int m_Iterations
        Keep track of the number of iterations completed before convergence.
      • m_squaredErrors

        protected double[] m_squaredErrors
        Holds the squared errors for all clusters.
      • m_DistanceFunction

        protected weka.core.DistanceFunction m_DistanceFunction
        the distance function used.
      • m_PreserveOrder

        protected boolean m_PreserveOrder
        Preserve order of instances.
      • m_Assignments

        protected int[] m_Assignments
        Assignments obtained.
      • m_FastDistanceCalc

        protected boolean m_FastDistanceCalc
        whether to use fast calculation of distances (using a cut-off).
      • TAGS_SELECTION

        public static final weka.core.Tag[] TAGS_SELECTION
        Initialization methods
      • m_initializationMethod

        protected int m_initializationMethod
        The initialization method to use
      • m_speedUpDistanceCompWithCanopies

        protected boolean m_speedUpDistanceCompWithCanopies
        Whether to reducet the number of distance calcs done by k-means with canopies
      • m_centroidCanopyAssignments

        protected List<long[]> m_centroidCanopyAssignments
        Canopies that each centroid falls into (determined by T1 radius)
      • m_dataPointCanopyAssignments

        protected List<long[]> m_dataPointCanopyAssignments
        Canopies that each training instance falls into (determined by T1 radius)
      • m_canopyClusters

        protected weka.clusterers.Canopy m_canopyClusters
        The canopy clusterer (if being used)
      • m_maxCanopyCandidates

        protected int m_maxCanopyCandidates
        The maximum number of candidate canopies to hold in memory at any one time (if using canopy clustering)
      • m_periodicPruningRate

        protected int m_periodicPruningRate
        Prune low-density candidate canopies after every x instances have been seen (if using canopy clustering)
      • m_minClusterDensity

        protected double m_minClusterDensity
        The minimum cluster density (according to T2 distance) allowed. Used when periodically pruning candidate canopies (if using canopy clustering)
      • m_t2

        protected double m_t2
        The t2 radius to pass through to Canopy
      • m_t1

        protected double m_t1
        The t1 radius to pass through to Canopy
      • m_executionSlots

        protected int m_executionSlots
        Number of threads to run
      • m_executorPool

        protected transient ExecutorService m_executorPool
        For parallel execution mode
      • m_completed

        protected int m_completed
      • m_failed

        protected int m_failed
    • Constructor Detail

      • SAXKMeans

        public SAXKMeans()
        the default constructor.
    • Method Detail

      • startExecutorPool

        protected void startExecutorPool()
        Start the pool of execution threads
      • getTechnicalInformation

        public weka.core.TechnicalInformation getTechnicalInformation()
        Specified by:
        getTechnicalInformation in interface weka.core.TechnicalInformationHandler
      • globalInfo

        public String globalInfo()
        Returns a string describing this clusterer.
        Returns:
        a description of the evaluator suitable for displaying in the explorer/experimenter gui
      • getCapabilities

        public weka.core.Capabilities getCapabilities()
        Returns default capabilities of the clusterer.
        Specified by:
        getCapabilities in interface weka.core.CapabilitiesHandler
        Specified by:
        getCapabilities in interface weka.clusterers.Clusterer
        Overrides:
        getCapabilities in class weka.clusterers.AbstractClusterer
        Returns:
        the capabilities of this clusterer
      • launchMoveCentroids

        protected int launchMoveCentroids​(weka.core.Instances[] clusters)
        Launch the move centroids tasks
        Parameters:
        clusters - the cluster centroids
        Returns:
        the number of empty clusters
      • launchAssignToClusters

        protected boolean launchAssignToClusters​(weka.core.Instances insts,
                                                 int[] clusterAssignments)
                                          throws Exception
        Launch the tasks that assign instances to clusters
        Parameters:
        insts - the instances to be clustered
        clusterAssignments - the array of cluster assignments
        Returns:
        true if k means has converged
        Throws:
        Exception - if a problem occurs
      • buildClusterer

        public void buildClusterer​(weka.core.Instances data)
                            throws Exception
        Generates a clusterer. Has to initialize all fields of the clusterer that are not being set via options.
        Specified by:
        buildClusterer in interface weka.clusterers.Clusterer
        Specified by:
        buildClusterer in class weka.clusterers.AbstractClusterer
        Parameters:
        data - set of instances serving as training data
        Throws:
        Exception - if the clusterer has not been generated successfully
      • canopyInit

        protected void canopyInit​(weka.core.Instances data)
                           throws Exception
        Initialize with the canopy centers of the Canopy clustering method
        Parameters:
        data - the training data
        Throws:
        Exception - if a problem occurs
      • farthestFirstInit

        protected void farthestFirstInit​(weka.core.Instances data)
                                  throws Exception
        Initialize with the fartherst first centers
        Parameters:
        data - the training data
        Throws:
        Exception - if a problem occurs
      • kMeansPlusPlusInit

        protected void kMeansPlusPlusInit​(weka.core.Instances data)
                                   throws Exception
        Initialize using the k-means++ method
        Parameters:
        data - the training data
        Throws:
        Exception - if a problem occurs
      • moveCentroid

        protected double[] moveCentroid​(int centroidIndex,
                                        weka.core.Instances members,
                                        boolean updateClusterInfo,
                                        boolean addToCentroidInstances)
        Move the centroid to it's new coordinates. Generate the centroid coordinates based on it's members (objects assigned to the cluster of the centroid) and the distance function being used.
        Parameters:
        centroidIndex - index of the centroid which the coordinates will be computed
        members - the objects that are assigned to the cluster of this centroid
        updateClusterInfo - if the method is supposed to update the m_Cluster arrays
        addToCentroidInstances - true if the method is to add the computed coordinates to the Instances holding the centroids
        Returns:
        the centroid coordinates
      • clusterInstance

        public int clusterInstance​(weka.core.Instance instance)
                            throws Exception
        Classifies a given instance.
        Specified by:
        clusterInstance in interface weka.clusterers.Clusterer
        Overrides:
        clusterInstance in class weka.clusterers.AbstractClusterer
        Parameters:
        instance - the instance to be assigned to a cluster
        Returns:
        the number of the assigned cluster as an interger if the class is enumerated, otherwise the predicted value
        Throws:
        Exception - if instance could not be classified successfully
      • numberOfClusters

        public int numberOfClusters()
                             throws Exception
        Returns the number of clusters.
        Specified by:
        numberOfClusters in interface weka.clusterers.Clusterer
        Specified by:
        numberOfClusters in class weka.clusterers.AbstractClusterer
        Returns:
        the number of clusters generated for a training dataset.
        Throws:
        Exception - if number of clusters could not be returned successfully
      • listOptions

        public Enumeration<weka.core.Option> listOptions()
        Returns an enumeration describing the available options.
        Specified by:
        listOptions in interface weka.core.OptionHandler
        Overrides:
        listOptions in class weka.clusterers.RandomizableClusterer
        Returns:
        an enumeration of all the available options.
      • numClustersTipText

        public String numClustersTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setNumClusters

        public void setNumClusters​(int n)
                            throws Exception
        set the number of clusters to generate.
        Specified by:
        setNumClusters in interface weka.clusterers.NumberOfClustersRequestable
        Parameters:
        n - the number of clusters to generate
        Throws:
        Exception - if number of clusters is negative
      • getNumClusters

        public int getNumClusters()
        gets the number of clusters to generate.
        Returns:
        the number of clusters to generate
      • initializationMethodTipText

        public String initializationMethodTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setInitializationMethod

        public void setInitializationMethod​(weka.core.SelectedTag method)
        Set the initialization method to use
        Parameters:
        method - the initialization method to use
      • getInitializationMethod

        public weka.core.SelectedTag getInitializationMethod()
        Get the initialization method to use
        Returns:
        method the initialization method to use
      • reduceNumberOfDistanceCalcsViaCanopiesTipText

        public String reduceNumberOfDistanceCalcsViaCanopiesTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setReduceNumberOfDistanceCalcsViaCanopies

        public void setReduceNumberOfDistanceCalcsViaCanopies​(boolean c)
        Set whether to use canopies to reduce the number of distance computations required
        Parameters:
        c - true if canopies are to be used to reduce the number of distance computations
      • getReduceNumberOfDistanceCalcsViaCanopies

        public boolean getReduceNumberOfDistanceCalcsViaCanopies()
        Get whether to use canopies to reduce the number of distance computations required
        Returns:
        true if canopies are to be used to reduce the number of distance computations
      • canopyPeriodicPruningRateTipText

        public String canopyPeriodicPruningRateTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setCanopyPeriodicPruningRate

        public void setCanopyPeriodicPruningRate​(int p)
        Set the how often to prune low density canopies during training (if using canopy clustering)
        Parameters:
        p - how often (every p instances) to prune low density canopies
      • getCanopyPeriodicPruningRate

        public int getCanopyPeriodicPruningRate()
        Get the how often to prune low density canopies during training (if using canopy clustering)
        Returns:
        how often (every p instances) to prune low density canopies
      • canopyMinimumCanopyDensityTipText

        public String canopyMinimumCanopyDensityTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setCanopyMinimumCanopyDensity

        public void setCanopyMinimumCanopyDensity​(double dens)
        Set the minimum T2-based density below which a canopy will be pruned during periodic pruning.
        Parameters:
        dens - the minimum canopy density
      • getCanopyMinimumCanopyDensity

        public double getCanopyMinimumCanopyDensity()
        Get the minimum T2-based density below which a canopy will be pruned during periodic pruning.
        Returns:
        the minimum canopy density
      • canopyMaxNumCanopiesToHoldInMemoryTipText

        public String canopyMaxNumCanopiesToHoldInMemoryTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setCanopyMaxNumCanopiesToHoldInMemory

        public void setCanopyMaxNumCanopiesToHoldInMemory​(int max)
        Set the maximum number of candidate canopies to retain in memory during training. T2 distance and data characteristics determine how many candidate canopies are formed before periodic and final pruning are performed. There may not be enough memory available if T2 is set too low.
        Parameters:
        max - the maximum number of candidate canopies to retain in memory during training
      • getCanopyMaxNumCanopiesToHoldInMemory

        public int getCanopyMaxNumCanopiesToHoldInMemory()
        Get the maximum number of candidate canopies to retain in memory during training. T2 distance and data characteristics determine how many candidate canopies are formed before periodic and final pruning are performed. There may not be enough memory available if T2 is set too low.
        Returns:
        the maximum number of candidate canopies to retain in memory during training
      • canopyT2TipText

        public String canopyT2TipText()
        Tip text for this property
        Returns:
        the tip text for this property
      • setCanopyT2

        public void setCanopyT2​(double t2)
        Set the t2 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs
        Parameters:
        t2 - the t2 radius to use
      • getCanopyT2

        public double getCanopyT2()
        Get the t2 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs
        Returns:
        the t2 radius to use
      • canopyT1TipText

        public String canopyT1TipText()
        Tip text for this property
        Returns:
        the tip text for this property
      • setCanopyT1

        public void setCanopyT1​(double t1)
        Set the t1 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs
        Parameters:
        t1 - the t1 radius to use
      • getCanopyT1

        public double getCanopyT1()
        Get the t1 radius to use when canopy clustering is being used as start points and/or to reduce the number of distance calcs
        Returns:
        the t1 radius to use
      • maxIterationsTipText

        public String maxIterationsTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setMaxIterations

        public void setMaxIterations​(int n)
                              throws Exception
        set the maximum number of iterations to be executed.
        Parameters:
        n - the maximum number of iterations
        Throws:
        Exception - if maximum number of iteration is smaller than 1
      • getMaxIterations

        public int getMaxIterations()
        gets the number of maximum iterations to be executed.
        Returns:
        the number of clusters to generate
      • displayStdDevsTipText

        public String displayStdDevsTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setDisplayStdDevs

        public void setDisplayStdDevs​(boolean stdD)
        Sets whether standard deviations and nominal count. Should be displayed in the clustering output.
        Parameters:
        stdD - true if std. devs and counts should be displayed
      • getDisplayStdDevs

        public boolean getDisplayStdDevs()
        Gets whether standard deviations and nominal count. Should be displayed in the clustering output.
        Returns:
        true if std. devs and counts should be displayed
      • dontReplaceMissingValuesTipText

        public String dontReplaceMissingValuesTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setDontReplaceMissingValues

        public void setDontReplaceMissingValues​(boolean r)
        Sets whether missing values are to be replaced.
        Parameters:
        r - true if missing values are to be replaced
      • getDontReplaceMissingValues

        public boolean getDontReplaceMissingValues()
        Gets whether missing values are to be replaced.
        Returns:
        true if missing values are to be replaced
      • distanceFunctionTipText

        public String distanceFunctionTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • getDistanceFunction

        public weka.core.DistanceFunction getDistanceFunction()
        returns the distance function currently in use.
        Returns:
        the distance function
      • setDistanceFunction

        public void setDistanceFunction​(weka.core.DistanceFunction df)
                                 throws Exception
        sets the distance function to use for instance comparison.
        Parameters:
        df - the new distance function to use
        Throws:
        Exception - if instances cannot be processed
      • preserveInstancesOrderTipText

        public String preserveInstancesOrderTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setPreserveInstancesOrder

        public void setPreserveInstancesOrder​(boolean r)
        Sets whether order of instances must be preserved.
        Parameters:
        r - true if missing values are to be replaced
      • getPreserveInstancesOrder

        public boolean getPreserveInstancesOrder()
        Gets whether order of instances must be preserved.
        Returns:
        true if missing values are to be replaced
      • fastDistanceCalcTipText

        public String fastDistanceCalcTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setFastDistanceCalc

        public void setFastDistanceCalc​(boolean value)
        Sets whether to use faster distance calculation.
        Parameters:
        value - true if faster calculation to be used
      • getFastDistanceCalc

        public boolean getFastDistanceCalc()
        Gets whether to use faster distance calculation.
        Returns:
        true if faster calculation is used
      • numExecutionSlotsTipText

        public String numExecutionSlotsTipText()
        Returns the tip text for this property
        Returns:
        tip text for this property suitable for displaying in the explorer/experimenter gui
      • setNumExecutionSlots

        public void setNumExecutionSlots​(int slots)
        Set the degree of parallelism to use.
        Parameters:
        slots - the number of tasks to run in parallel when computing the nearest neighbors and evaluating different values of k between the lower and upper bounds
      • getNumExecutionSlots

        public int getNumExecutionSlots()
        Get the degree of parallelism to use.
        Returns:
        the number of tasks to run in parallel when computing the nearest neighbors and evaluating different values of k between the lower and upper bounds
      • setOptions

        public void setOptions​(String[] options)
                        throws Exception
        Parses a given list of options.

        Valid options are:

         -N <num>
          Number of clusters.
          (default 2).
         -init
          Initialization method to use.
          0 = random, 1 = k-means++, 2 = canopy, 3 = farthest first.
          (default = 0)
         -C
          Use canopies to reduce the number of distance calculations.
         -max-candidates <num>
          Maximum number of candidate canopies to retain in memory
          at any one time when using canopy clustering.
          T2 distance plus, data characteristics,
          will determine how many candidate canopies are formed before
          periodic and final pruning are performed, which might result
          in exceess memory consumption. This setting avoids large numbers
          of candidate canopies consuming memory. (default = 100)
         -periodic-pruning <num>
          How often to prune low density canopies when using canopy clustering. 
          (default = every 10,000 training instances)
         -min-density
          Minimum canopy density, when using canopy clustering, below which
           a canopy will be pruned during periodic pruning. (default = 2 instances)
         -t2
          The T2 distance to use when using canopy clustering. Values < 0 indicate that
          a heuristic based on attribute std. deviation should be used to set this.
          (default = -1.0)
         -t1
          The T1 distance to use when using canopy clustering. A value < 0 is taken as a
          positive multiplier for T2. (default = -1.5)
         -V
          Display std. deviations for centroids.
         
         -M
          Don't replace missing values with mean/mode.
         
         -A <classname and options>
          Distance function to use.
          (default: weka.core.SAXDistance)
         -I <num>
          Maximum number of iterations.
         
         -O
          Preserve order of instances.
         
         -fast
          Enables faster distance calculations, using cut-off values.
          Disables the calculation/output of squared errors/distances.
         
         -num-slots <num>
          Number of execution slots.
          (default 1 - i.e. no parallelism)
         -S <num>
          Random number seed.
          (default 10)
         -output-debug-info
          If set, clusterer is run in debug mode and
          may output additional info to the console
         -do-not-check-capabilities
          If set, clusterer capabilities are not checked before clusterer is built
          (use with caution).
        Specified by:
        setOptions in interface weka.core.OptionHandler
        Overrides:
        setOptions in class weka.clusterers.RandomizableClusterer
        Parameters:
        options - the list of options as an array of strings
        Throws:
        Exception - if an option is not supported
      • getOptions

        public String[] getOptions()
        Gets the current settings of SimpleKMeans.
        Specified by:
        getOptions in interface weka.core.OptionHandler
        Overrides:
        getOptions in class weka.clusterers.RandomizableClusterer
        Returns:
        an array of strings suitable for passing to setOptions()
      • toString

        public String toString()
        return a string describing this clusterer.
        Overrides:
        toString in class Object
        Returns:
        a description of the clusterer as a string
      • getClusterCentroids

        public weka.core.Instances getClusterCentroids()
        Gets the cluster centroids.
        Returns:
        the cluster centroids
      • getClusterStandardDevs

        public weka.core.Instances getClusterStandardDevs()
        Gets the standard deviations of the numeric attributes in each cluster.
        Returns:
        the standard deviations of the numeric attributes in each cluster
      • getClusterNominalCounts

        public int[][][] getClusterNominalCounts()
        Returns for each cluster the frequency counts for the values of each nominal attribute.
        Returns:
        the counts
      • getSquaredError

        public double getSquaredError()
        Gets the squared error for all clusters.
        Returns:
        the squared error, NaN if fast distance calculation is used
        See Also:
        m_FastDistanceCalc
      • getClusterSizes

        public int[] getClusterSizes()
        Gets the number of instances in each cluster.
        Returns:
        The number of instances in each cluster
      • getAssignments

        public int[] getAssignments()
                             throws Exception
        Gets the assignments for each instance.
        Returns:
        Array of indexes of the centroid assigned to each instance
        Throws:
        Exception - if order of instances wasn't preserved or no assignments were made
      • getRevision

        public String getRevision()
        Returns the revision string.
        Specified by:
        getRevision in interface weka.core.RevisionHandler
        Overrides:
        getRevision in class weka.clusterers.AbstractClusterer
        Returns:
        the revision
      • main

        public static void main​(String[] args)
        Main method for executing this class.
        Parameters:
        args - use -h to list all parameters