Package adams.tools
Class CompareDatasets
- java.lang.Object
-
- adams.core.logging.LoggingObject
-
- adams.core.logging.CustomLoggingLevelObject
-
- adams.core.option.AbstractOptionHandler
-
- adams.tools.AbstractTool
-
- adams.tools.CompareDatasets
-
- All Implemented Interfaces:
adams.core.CleanUpHandler
,adams.core.Destroyable
,adams.core.GlobalInfoSupporter
,adams.core.io.FileWriter
,adams.core.logging.LoggingLevelHandler
,adams.core.logging.LoggingSupporter
,adams.core.option.OptionHandler
,adams.core.SizeOfHandler
,adams.core.Stoppable
,adams.core.StoppableWithFeedback
,adams.tools.OutputFileGenerator
,Serializable
,Comparable
public class CompareDatasets extends adams.tools.AbstractTool implements adams.tools.OutputFileGenerator
Compares two datasets, either row-by-row or using a row attribute listing a unique ID for matching the rows, outputting the correlation coefficient of the numeric attributes found in the ranges defined by the user.
In order to trim down the number of generated rows, a threshold can be specified. Only rows are output which correlation coefficient is below that threshold.
Valid options are:
-D <int> (property: debugLevel) The greater the number the more additional info the scheme may output to the console (0 = off). default: 0 minimum: 0
-dataset1 <adams.core.io.PlaceholderFile> (property: dataset1) The first dataset in the comparison. default: .
-range1 <java.lang.String> (property: range1) The range of attributes of the first dataset. default: first-last
-row1 <java.lang.String> (property: rowAttribute1) The index for the attribute used for identifying rows to compare; if not provided, then the comparison is performed row-by-row (first dataset). default:
-dataset2 <adams.core.io.PlaceholderFile> (property: dataset2) The second dataset in the comparison. default: .
-range2 <java.lang.String> (property: range2) The range of attributes of the second dataset. default: first-last
-row2 <java.lang.String> (property: rowAttribute2) The index for the attribute used for identifying rows to compare; if not provided, then the comparison is performed row-by-row (second dataset). default:
-output <adams.core.io.PlaceholderFile> (property: outputFile) The file to save the comparison result in (CSV format). default: output.csv
-missing <adams.core.io.PlaceholderFile> (property: missing) The file to save the information about missing rows to (CSV format). default: missing.csv
-threshold <double> (property: threshold) The threshold for the correlation coefficient; only if the coefficient is below that threshold, it will get output; 0.0 turns the threshold off. default: 0.0 minimum: 0.0 maximum: 1.0
- Version:
- $Revision$
- Author:
- fracpete (fracpete at waikato dot ac dot nz)
- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description protected weka.core.Instances
m_Data1
the current dataset 1.protected weka.core.Instances
m_Data2
the current dataset 2.protected adams.core.io.PlaceholderFile
m_Dataset1
the first dataset.protected adams.core.io.PlaceholderFile
m_Dataset2
the second dataset.protected int[]
m_Indices1
the indices for the first dataset.protected int[]
m_Indices2
the indices for the second dataset.protected Hashtable<String,Integer>
m_Lookup2
the lookup table of indices for the second dataset.protected adams.core.io.PlaceholderFile
m_Missing
the output file for missing tests (CSV format).protected adams.core.io.PlaceholderFile
m_OutputFile
the output file (CSV format).protected adams.core.Range
m_Range1
the first range of attributes.protected adams.core.Range
m_Range2
the second range of attributes.protected adams.core.Index
m_RowAttribute1
the optional attribute for matching up rows (dataset 1).protected adams.core.Index
m_RowAttribute2
the optional attribute for matching up rows (dataset 2).protected boolean
m_RowAttributeIsString
whether the row attribute is a string/nominal attribute or not.protected double
m_Threshold
the threshold for listing correlations.protected Boolean
m_UseRowAttribute
whether to use the row attribute or not.
-
Constructor Summary
Constructors Constructor Description CompareDatasets()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description void
cleanUp()
Cleans up data structures, frees up memory.String
dataset1TipText()
Returns the tip text for this property.String
dataset2TipText()
Returns the tip text for this property.void
defineOptions()
Adds options to the internal list of options.protected void
doRun()
Performs the comparison.protected double
getCorrelation(weka.core.Instance first, weka.core.Instance second)
Returns the correlation between the two rows.adams.core.io.PlaceholderFile
getDataset1()
Returns the first dataset for the comparison.adams.core.io.PlaceholderFile
getDataset2()
Returns the second dataset for the comparison.adams.core.io.PlaceholderFile
getMissing()
Returns the first dataset for the comparison.adams.core.io.PlaceholderFile
getOutputFile()
Returns the first dataset for the comparison.adams.core.Range
getRange1()
Returns the range of attributes of the first dataset.adams.core.Range
getRange2()
Returns the range of attributes of the second dataset.String
getRowAttribute1()
Returns the index of the attribute used for identifying rows to compare against each other (first dataset).String
getRowAttribute2()
Returns the index of the attribute used for identifying rows to compare against each other (second dataset).protected String
getRowID(int index)
Returns either the ID for the row, either the row index of the actual row attribute ID for that position.double
getThreshold()
Returns the threshold for the correlation coefficient.protected boolean
getUseRowAttribute()
Returns whether to use the row attribute or the order in the datasets for matching up the rows.String
globalInfo()
Returns a string describing the object.protected void
initialize()
Initializes the members.protected void
initLookup()
Initializes the lookup table of indices for the second dataset, if necessary.String
missingTipText()
Returns the tip text for this property.protected weka.core.Instance[]
next(int index)
Returns the next row pair to compare.protected weka.core.Instance[]
nextByIndex(int index)
Returns the next pair by simple index.protected weka.core.Instance[]
nextByRowAttribute(int index)
Returns the next pair by using the value of the row attribute.String
outputFileTipText()
Returns the tip text for this property.protected void
preRun()
Before the actual run is executed.String
range1TipText()
Returns the tip text for this property.String
range2TipText()
Returns the tip text for this property.String
rowAttribute1TipText()
Returns the tip text for this property.String
rowAttribute2TipText()
Returns the tip text for this property.void
setDataset1(adams.core.io.PlaceholderFile value)
Sets the first dataset for the comparison.void
setDataset2(adams.core.io.PlaceholderFile value)
Sets the second dataset for the comparison.void
setMissing(adams.core.io.PlaceholderFile value)
Sets the first dataset for the comparison.void
setOutputFile(adams.core.io.PlaceholderFile value)
Sets the first dataset for the comparison.void
setRange1(adams.core.Range value)
Sets the range of attributes of the first dataset.void
setRange2(adams.core.Range value)
Sets the range of attributes of the second dataset.void
setRowAttribute1(String value)
Sets the index of the attribute used for identifying rows to compare against each other (first dataset).void
setRowAttribute2(String value)
Sets the index of the attribute used for identifying rows to compare against each other (second dataset).void
setThreshold(double value)
Sets the threshold for the correlation coefficient.String
thresholdTipText()
Returns the tip text for this property.-
Methods inherited from class adams.tools.AbstractTool
compareTo, destroy, equals, forCommandLine, forName, getTools, isStopped, postRun, run, runTool, stopExecution
-
Methods inherited from class adams.core.option.AbstractOptionHandler
cleanUpOptions, finishInit, getDefaultLoggingLevel, getOptionManager, loggingLevelTipText, newOptionManager, reset, setLoggingLevel, toCommandLine, toString
-
Methods inherited from class adams.core.logging.LoggingObject
configureLogger, getLogger, getLoggingLevel, initializeLogging, isLoggingEnabled, sizeOf
-
-
-
-
Field Detail
-
m_Dataset1
protected adams.core.io.PlaceholderFile m_Dataset1
the first dataset.
-
m_Range1
protected adams.core.Range m_Range1
the first range of attributes.
-
m_RowAttribute1
protected adams.core.Index m_RowAttribute1
the optional attribute for matching up rows (dataset 1).
-
m_Dataset2
protected adams.core.io.PlaceholderFile m_Dataset2
the second dataset.
-
m_Range2
protected adams.core.Range m_Range2
the second range of attributes.
-
m_RowAttribute2
protected adams.core.Index m_RowAttribute2
the optional attribute for matching up rows (dataset 2).
-
m_OutputFile
protected adams.core.io.PlaceholderFile m_OutputFile
the output file (CSV format).
-
m_Missing
protected adams.core.io.PlaceholderFile m_Missing
the output file for missing tests (CSV format).
-
m_Data1
protected weka.core.Instances m_Data1
the current dataset 1.
-
m_Data2
protected weka.core.Instances m_Data2
the current dataset 2.
-
m_UseRowAttribute
protected Boolean m_UseRowAttribute
whether to use the row attribute or not.
-
m_RowAttributeIsString
protected boolean m_RowAttributeIsString
whether the row attribute is a string/nominal attribute or not.
-
m_Indices1
protected int[] m_Indices1
the indices for the first dataset.
-
m_Indices2
protected int[] m_Indices2
the indices for the second dataset.
-
m_Lookup2
protected Hashtable<String,Integer> m_Lookup2
the lookup table of indices for the second dataset.
-
m_Threshold
protected double m_Threshold
the threshold for listing correlations.
-
-
Method Detail
-
globalInfo
public String globalInfo()
Returns a string describing the object.- Specified by:
globalInfo
in interfaceadams.core.GlobalInfoSupporter
- Specified by:
globalInfo
in classadams.core.option.AbstractOptionHandler
- Returns:
- a description suitable for displaying in the gui
-
defineOptions
public void defineOptions()
Adds options to the internal list of options.- Specified by:
defineOptions
in interfaceadams.core.option.OptionHandler
- Overrides:
defineOptions
in classadams.core.option.AbstractOptionHandler
-
initialize
protected void initialize()
Initializes the members.- Overrides:
initialize
in classadams.core.option.AbstractOptionHandler
-
setDataset1
public void setDataset1(adams.core.io.PlaceholderFile value)
Sets the first dataset for the comparison.- Parameters:
value
- the dataset
-
getDataset1
public adams.core.io.PlaceholderFile getDataset1()
Returns the first dataset for the comparison.- Returns:
- the dataset
-
dataset1TipText
public String dataset1TipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the GUI or for listing the options.
-
setDataset2
public void setDataset2(adams.core.io.PlaceholderFile value)
Sets the second dataset for the comparison.- Parameters:
value
- the dataset
-
getDataset2
public adams.core.io.PlaceholderFile getDataset2()
Returns the second dataset for the comparison.- Returns:
- the dataset
-
dataset2TipText
public String dataset2TipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the GUI or for listing the options.
-
setRange1
public void setRange1(adams.core.Range value)
Sets the range of attributes of the first dataset.- Parameters:
value
- the range
-
getRange1
public adams.core.Range getRange1()
Returns the range of attributes of the first dataset.- Returns:
- the range
-
range1TipText
public String range1TipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the GUI or for listing the options.
-
setRange2
public void setRange2(adams.core.Range value)
Sets the range of attributes of the second dataset.- Parameters:
value
- the range
-
getRange2
public adams.core.Range getRange2()
Returns the range of attributes of the second dataset.- Returns:
- the range
-
range2TipText
public String range2TipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the GUI or for listing the options.
-
setRowAttribute1
public void setRowAttribute1(String value)
Sets the index of the attribute used for identifying rows to compare against each other (first dataset).- Parameters:
value
- the index
-
getRowAttribute1
public String getRowAttribute1()
Returns the index of the attribute used for identifying rows to compare against each other (first dataset).- Returns:
- the index
-
rowAttribute1TipText
public String rowAttribute1TipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the GUI or for listing the options.
-
setRowAttribute2
public void setRowAttribute2(String value)
Sets the index of the attribute used for identifying rows to compare against each other (second dataset).- Parameters:
value
- the index
-
getRowAttribute2
public String getRowAttribute2()
Returns the index of the attribute used for identifying rows to compare against each other (second dataset).- Returns:
- the index
-
rowAttribute2TipText
public String rowAttribute2TipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the GUI or for listing the options.
-
setOutputFile
public void setOutputFile(adams.core.io.PlaceholderFile value)
Sets the first dataset for the comparison.- Specified by:
setOutputFile
in interfaceadams.core.io.FileWriter
- Parameters:
value
- the dataset
-
getOutputFile
public adams.core.io.PlaceholderFile getOutputFile()
Returns the first dataset for the comparison.- Specified by:
getOutputFile
in interfaceadams.core.io.FileWriter
- Returns:
- the dataset
-
outputFileTipText
public String outputFileTipText()
Returns the tip text for this property.- Specified by:
outputFileTipText
in interfaceadams.core.io.FileWriter
- Returns:
- tip text for this property suitable for displaying in the GUI or for listing the options.
-
setMissing
public void setMissing(adams.core.io.PlaceholderFile value)
Sets the first dataset for the comparison.- Parameters:
value
- the dataset
-
getMissing
public adams.core.io.PlaceholderFile getMissing()
Returns the first dataset for the comparison.- Returns:
- the dataset
-
missingTipText
public String missingTipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the GUI or for listing the options.
-
setThreshold
public void setThreshold(double value)
Sets the threshold for the correlation coefficient.- Parameters:
value
- the threshold (0.0 turns it off)
-
getThreshold
public double getThreshold()
Returns the threshold for the correlation coefficient.- Returns:
- the threshold (0.0 means it is turned off)
-
thresholdTipText
public String thresholdTipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the GUI or for listing the options.
-
preRun
protected void preRun()
Before the actual run is executed.- Overrides:
preRun
in classadams.tools.AbstractTool
-
getUseRowAttribute
protected boolean getUseRowAttribute()
Returns whether to use the row attribute or the order in the datasets for matching up the rows.- Returns:
- true if the row attribute is used for matching
-
getRowID
protected String getRowID(int index)
Returns either the ID for the row, either the row index of the actual row attribute ID for that position.- Parameters:
index
- the index to get the ID for- Returns:
- the ID
-
nextByIndex
protected weka.core.Instance[] nextByIndex(int index)
Returns the next pair by simple index.- Parameters:
index
- the index of the pair to retrieve- Returns:
- the row pair or null if not available
-
initLookup
protected void initLookup()
Initializes the lookup table of indices for the second dataset, if necessary.
-
nextByRowAttribute
protected weka.core.Instance[] nextByRowAttribute(int index)
Returns the next pair by using the value of the row attribute.- Parameters:
index
- the index of the pair to retrieve- Returns:
- the row pair or null if not available
-
next
protected weka.core.Instance[] next(int index)
Returns the next row pair to compare.- Parameters:
index
- the index of the pair to retrieve- Returns:
- the row pair or null if not available
-
getCorrelation
protected double getCorrelation(weka.core.Instance first, weka.core.Instance second)
Returns the correlation between the two rows.- Parameters:
first
- the first rowsecond
- the second row- Returns:
- the correlation
-
doRun
protected void doRun()
Performs the comparison.- Specified by:
doRun
in classadams.tools.AbstractTool
-
cleanUp
public void cleanUp()
Cleans up data structures, frees up memory.- Specified by:
cleanUp
in interfaceadams.core.CleanUpHandler
- Overrides:
cleanUp
in classadams.tools.AbstractTool
-
-