public class Canopy extends RandomizableClusterer implements UpdateableClusterer, NumberOfClustersRequestable, OptionHandler, TechnicalInformationHandler
@inproceedings{McCallum2000,
author = {A. McCallum and K. Nigam and L.H. Ungar},
booktitle = {Proceedings of the sixth ACM SIGKDD internation conference on knowledge discovery and data mining ACM-SIAM symposium on Discrete algorithms},
pages = {169-178},
title = {Efficient Clustering of High Dimensional Data Sets with Application to Reference Matching},
year = {2000}
}
Valid options are:
-N <num> Number of clusters. (default 2).
-t2 The T2 distance to use. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. Note that this heuristic can only be used when batch training (default = -1.0)
-t1 The T1 distance to use. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-M Don't replace missing values with mean/mode when running in batch mode.
-S <num> Random number seed. (default 1)
| Modifier and Type | Field and Description |
|---|---|
static double |
DEFAULT_T1 |
static double |
DEFAULT_T2 |
| Constructor and Description |
|---|
Canopy() |
| Modifier and Type | Method and Description |
|---|---|
long[] |
assignCanopies(Instance inst)
Uses T1 distance to assign canopies to the supplied instance.
|
void |
buildClusterer(Instances data)
Generates a clusterer.
|
double[] |
distributionForInstance(Instance instance)
Predicts the cluster memberships for a given instance.
|
String |
dontReplaceMissingValuesTipText()
Returns the tip text for this property.
|
double |
getActualT1()
Get the actual value of T1 (which may be different from the initial value
if the heuristic is used)
|
double |
getActualT2()
Get the actual value of T2 (which may be different from the initial value
if the heuristic is used)
|
Instances |
getCanopies()
Get the canopies (cluster centers).
|
Capabilities |
getCapabilities()
Returns default capabilities of the clusterer.
|
List<long[]> |
getClusterCanopyAssignments()
Get the canopies that each canopy (cluster center) is within T1 distance of
|
boolean |
getDontReplaceMissingValues()
Gets whether missing values are to be replaced.
|
int |
getNumClusters()
Get the number of clusters to generate
|
String[] |
getOptions()
Gets the current settings of Canopy.
|
double |
getT1()
Get the T1 distance.
|
double |
getT2()
Get the T2 distance to use.
|
TechnicalInformation |
getTechnicalInformation()
Returns an instance of a TechnicalInformation object, containing
detailed information about the technical background of this class,
e.g., paper reference or book this class is based on.
|
String |
globalInfo()
Returns a string describing this clusterer.
|
void |
initializeDistanceFunction(Instances init)
Initialize the distance function (i.e set min/max values for numeric
attributes) with the supplied instances.
|
Enumeration<Option> |
listOptions()
Returns an enumeration describing the available options.
|
static void |
main(String[] args) |
static boolean |
nonEmptyCanopySetIntersection(long[] first,
long[] second)
Tests if two sets of canopies have a non-empty intersection
|
int |
numberOfClusters()
Returns the number of clusters.
|
String |
numClustersTipText()
Returns the tip text for this property.
|
static String |
printCanopyAssignments(Instances dataPoints,
List<long[]> canopyAssignments)
Print the supplied instances and their canopies
|
static String |
printSingleAssignment(long[] assignments) |
void |
setCanopies(Instances canopies)
Set the canopies to use (replaces any learned by this clusterer already)
|
void |
setClusterCanopyAssignments(List<long[]> clusterCanopies)
Set the canopies that each canopy (cluster center) is within T1 distance of
|
void |
setDontReplaceMissingValues(boolean r)
Sets whether missing values are to be replaced.
|
void |
setMissingValuesReplacer(Filter missingReplacer)
Set a ready-to-use missing values replacement filter
|
void |
setNumClusters(int numClusters)
Set the number of clusters to generate
|
void |
setOptions(String[] options)
Parses a given list of options.
|
void |
setT1(double t1)
Set the T1 distance.
|
void |
setT2(double t2)
Set the T2 distance to use.
|
String |
t1TipText()
Tip text for this property
|
String |
t2TipText()
Tip text for this property
|
String |
toString() |
String |
toString(boolean header)
Return a textual description of this clusterer
|
void |
updateClusterer(Instance newInstance)
Adds an instance to the clusterer.
|
void |
updateFinished()
Singals the end of the updating.
|
getSeed, seedTipText, setSeedclusterInstance, debugTipText, doNotCheckCapabilitiesTipText, forName, getDebug, getDoNotCheckCapabilities, getRevision, makeCopies, makeCopy, runClusterer, setDebug, setDoNotCheckCapabilitiespublic static final double DEFAULT_T2
public static final double DEFAULT_T1
public String globalInfo()
public TechnicalInformation getTechnicalInformation()
TechnicalInformationHandlergetTechnicalInformation in interface TechnicalInformationHandlerpublic Capabilities getCapabilities()
getCapabilities in interface ClusterergetCapabilities in interface CapabilitiesHandlergetCapabilities in class AbstractClustererCapabilitiespublic Enumeration<Option> listOptions()
listOptions in interface OptionHandlerlistOptions in class RandomizableClustererpublic void setOptions(String[] options) throws Exception
-N <num> Number of clusters. (default 2).
-t2 The T2 distance to use. Values < 0 indicate that a heuristic based on attribute std. deviation should be used to set this. Note that this heuristic can only be used when batch training (default = -1.0)
-t1 The T1 distance to use. A value < 0 is taken as a positive multiplier for T2. (default = -1.5)
-M Don't replace missing values with mean/mode when running in batch mode.
-S <num> Random number seed. (default 1)
setOptions in interface OptionHandlersetOptions in class RandomizableClustereroptions - the list of options as an array of strings throws Exception
if an option is not supportedException - if an option is not supportedpublic String[] getOptions()
getOptions in interface OptionHandlergetOptions in class RandomizableClustererpublic static boolean nonEmptyCanopySetIntersection(long[] first,
long[] second)
throws Exception
first - the first canopy setsecond - the second canopy setException - if a problem occurspublic long[] assignCanopies(Instance inst) throws Exception
inst - the instance to find covering canopies forException - if a problem occurspublic void updateClusterer(Instance newInstance) throws Exception
UpdateableClustererupdateClusterer in interface UpdateableClusterernewInstance - the instance to be addedException - if something goes wrongpublic double[] distributionForInstance(Instance instance) throws Exception
AbstractClustererdistributionForInstance in interface ClustererdistributionForInstance in class AbstractClustererinstance - the instance to be assigned a cluster.Exception - if distribution could not be computed successfullypublic void updateFinished()
UpdateableClustererupdateFinished in interface UpdateableClustererpublic void initializeDistanceFunction(Instances init) throws Exception
init - the instances to initialize withException - if a problem occurspublic void buildClusterer(Instances data) throws Exception
AbstractClustererbuildClusterer in interface ClustererbuildClusterer in class AbstractClustererdata - set of instances serving as training dataException - if the clusterer has not been generated successfullypublic int numberOfClusters()
throws Exception
AbstractClusterernumberOfClusters in interface ClusterernumberOfClusters in class AbstractClustererException - if number of clusters could not be returned
successfullypublic void setMissingValuesReplacer(Filter missingReplacer)
missingReplacer - the missing values replacement filter to usepublic Instances getCanopies()
public void setCanopies(Instances canopies)
canopies - the canopies to usepublic List<long[]> getClusterCanopyAssignments()
public void setClusterCanopyAssignments(List<long[]> clusterCanopies)
clusterCanopies - the list canopies for each cluster centerpublic double getActualT2()
public double getActualT1()
public String t1TipText()
public void setT1(double t1)
t1 - the T1 distance to usepublic double getT1()
public String t2TipText()
public void setT2(double t2)
t2 - the T2 distance to usepublic double getT2()
public String numClustersTipText()
public void setNumClusters(int numClusters)
throws Exception
NumberOfClustersRequestablesetNumClusters in interface NumberOfClustersRequestablenumClusters - the number of clusters to generateException - if the requested number of
clusters in inapropriatepublic String dontReplaceMissingValuesTipText()
public void setDontReplaceMissingValues(boolean r)
r - true if missing values are to be replacedpublic boolean getDontReplaceMissingValues()
public int getNumClusters()
public static String printSingleAssignment(long[] assignments)
public static String printCanopyAssignments(Instances dataPoints, List<long[]> canopyAssignments)
dataPoints - the instances to printcanopyAssignments - the canopy assignments, one assignment array for
each instancepublic String toString(boolean header)
header - true if the header should be printedpublic static void main(String[] args)
Copyright © 2014 University of Waikato, Hamilton, NZ. All Rights Reserved.