adams.tools
Class CompareDatasets

java.lang.Object
  extended by adams.core.ConsoleObject
      extended by adams.core.option.AbstractOptionHandler
          extended by adams.tools.AbstractTool
              extended by adams.tools.CompareDatasets
All Implemented Interfaces:
CleanUpHandler, Debuggable, Destroyable, OptionHandler, SizeOfHandler, OutputFileGenerator, Serializable, Comparable

public class CompareDatasets
extends AbstractTool
implements OutputFileGenerator

Compares two datasets, either row-by-row or using a row attribute listing a unique ID for matching the rows, outputting the correlation coefficient of the numeric attributes found in the ranges defined by the user.
In order to trim down the number of generated rows, a threshold can be specified. Only rows are output which correlation coefficient is below that threshold.

Valid options are:

-D <int> (property: debugLevel)
    The greater the number the more additional info the scheme may output to
    the console (0 = off).
    default: 0
    minimum: 0
 
-dataset1 <adams.core.io.PlaceholderFile> (property: dataset1)
    The first dataset in the comparison.
    default: .
 
-range1 <java.lang.String> (property: range1)
    The range of attributes of the first dataset.
    default: first-last
 
-row1 <java.lang.String> (property: rowAttribute1)
    The index for the attribute used for identifying rows to compare; if not
    provided, then the comparison is performed row-by-row (first dataset).
    default:
 
-dataset2 <adams.core.io.PlaceholderFile> (property: dataset2)
    The second dataset in the comparison.
    default: .
 
-range2 <java.lang.String> (property: range2)
    The range of attributes of the second dataset.
    default: first-last
 
-row2 <java.lang.String> (property: rowAttribute2)
    The index for the attribute used for identifying rows to compare; if not
    provided, then the comparison is performed row-by-row (second dataset).
    default:
 
-output <adams.core.io.PlaceholderFile> (property: outputFile)
    The file to save the comparison result in (CSV format).
    default: output.csv
 
-missing <adams.core.io.PlaceholderFile> (property: missing)
    The file to save the information about missing rows to (CSV format).
    default: missing.csv
 
-threshold <double> (property: threshold)
    The threshold for the correlation coefficient; only if the coefficient is
    below that threshold, it will get output; 0.0 turns the threshold off.
    default: 0.0
    minimum: 0.0
    maximum: 1.0
 

Version:
$Revision: 5563 $
Author:
fracpete (fracpete at waikato dot ac dot nz)
See Also:
Serialized Form

Field Summary
protected  weka.core.Instances m_Data1
          the current dataset 1.
protected  weka.core.Instances m_Data2
          the current dataset 2.
protected  PlaceholderFile m_Dataset1
          the first dataset.
protected  PlaceholderFile m_Dataset2
          the second dataset.
protected  int[] m_Indices1
          the indices for the first dataset.
protected  int[] m_Indices2
          the indices for the second dataset.
protected  Hashtable<String,Integer> m_Lookup2
          the lookup table of indices for the second dataset.
protected  PlaceholderFile m_Missing
          the output file for missing tests (CSV format).
protected  PlaceholderFile m_OutputFile
          the output file (CSV format).
protected  Range m_Range1
          the first range of attributes.
protected  Range m_Range2
          the second range of attributes.
protected  Index m_RowAttribute1
          the optional attribute for matching up rows (dataset 1).
protected  Index m_RowAttribute2
          the optional attribute for matching up rows (dataset 2).
protected  boolean m_RowAttributeIsString
          whether the row attribute is a string/nominal attribute or not.
protected  double m_Threshold
          the threshold for listing correlations.
protected  Boolean m_UseRowAttribute
          whether to use the row attribute or not.
 
Fields inherited from class adams.core.option.AbstractOptionHandler
m_DebugLevel, m_OptionManager
 
Constructor Summary
CompareDatasets()
           
 
Method Summary
 void cleanUp()
          Cleans up data structures, frees up memory.
 String dataset1TipText()
          Returns the tip text for this property.
 String dataset2TipText()
          Returns the tip text for this property.
 void defineOptions()
          Adds options to the internal list of options.
protected  void doRun()
          Performs the comparison.
protected  double getCorrelation(weka.core.Instance first, weka.core.Instance second)
          Returns the correlation between the two rows.
 PlaceholderFile getDataset1()
          Returns the first dataset for the comparison.
 PlaceholderFile getDataset2()
          Returns the second dataset for the comparison.
 PlaceholderFile getMissing()
          Returns the first dataset for the comparison.
 PlaceholderFile getOutputFile()
          Returns the first dataset for the comparison.
 Range getRange1()
          Returns the range of attributes of the first dataset.
 Range getRange2()
          Returns the range of attributes of the second dataset.
 String getRowAttribute1()
          Returns the index of the attribute used for identifying rows to compare against each other (first dataset).
 String getRowAttribute2()
          Returns the index of the attribute used for identifying rows to compare against each other (second dataset).
protected  String getRowID(int index)
          Returns either the ID for the row, either the row index of the actual row attribute ID for that position.
 double getThreshold()
          Returns the threshold for the correlation coefficient.
protected  boolean getUseRowAttribute()
          Returns whether to use the row attribute or the order in the datasets for matching up the rows.
 String globalInfo()
          Returns a string describing the object.
protected  void initialize()
          Initializes the members.
protected  void initLookup()
          Initializes the lookup table of indices for the second dataset, if necessary.
 String missingTipText()
          Returns the tip text for this property.
protected  weka.core.Instance[] next(int index)
          Returns the next row pair to compare.
protected  weka.core.Instance[] nextByIndex(int index)
          Returns the next pair by simple index.
protected  weka.core.Instance[] nextByRowAttribute(int index)
          Returns the next pair by using the value of the row attribute.
 String outputFileTipText()
          Returns the tip text for this property.
protected  void preRun()
          Before the actual run is executed.
 String range1TipText()
          Returns the tip text for this property.
 String range2TipText()
          Returns the tip text for this property.
 String rowAttribute1TipText()
          Returns the tip text for this property.
 String rowAttribute2TipText()
          Returns the tip text for this property.
 void setDataset1(PlaceholderFile value)
          Sets the first dataset for the comparison.
 void setDataset2(PlaceholderFile value)
          Sets the second dataset for the comparison.
 void setMissing(PlaceholderFile value)
          Sets the first dataset for the comparison.
 void setOutputFile(PlaceholderFile value)
          Sets the first dataset for the comparison.
 void setRange1(Range value)
          Sets the range of attributes of the first dataset.
 void setRange2(Range value)
          Sets the range of attributes of the second dataset.
 void setRowAttribute1(String value)
          Sets the index of the attribute used for identifying rows to compare against each other (first dataset).
 void setRowAttribute2(String value)
          Sets the index of the attribute used for identifying rows to compare against each other (second dataset).
 void setThreshold(double value)
          Sets the threshold for the correlation coefficient.
 String thresholdTipText()
          Returns the tip text for this property.
 
Methods inherited from class adams.tools.AbstractTool
compareTo, destroy, equals, forCommandLine, forName, getTools, postRun, run
 
Methods inherited from class adams.core.option.AbstractOptionHandler
cleanUpOptions, debug, debug, debugLevelTipText, finishInit, getDebugLevel, getOptionManager, isDebugOn, newOptionManager, reset, setDebugLevel, toCommandLine, toString
 
Methods inherited from class adams.core.ConsoleObject
getDebugging, getSystemErr, getSystemOut, sizeOf
 
Methods inherited from class java.lang.Object
clone, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
 

Field Detail

m_Dataset1

protected PlaceholderFile m_Dataset1
the first dataset.


m_Range1

protected Range m_Range1
the first range of attributes.


m_RowAttribute1

protected Index m_RowAttribute1
the optional attribute for matching up rows (dataset 1).


m_Dataset2

protected PlaceholderFile m_Dataset2
the second dataset.


m_Range2

protected Range m_Range2
the second range of attributes.


m_RowAttribute2

protected Index m_RowAttribute2
the optional attribute for matching up rows (dataset 2).


m_OutputFile

protected PlaceholderFile m_OutputFile
the output file (CSV format).


m_Missing

protected PlaceholderFile m_Missing
the output file for missing tests (CSV format).


m_Data1

protected weka.core.Instances m_Data1
the current dataset 1.


m_Data2

protected weka.core.Instances m_Data2
the current dataset 2.


m_UseRowAttribute

protected Boolean m_UseRowAttribute
whether to use the row attribute or not.


m_RowAttributeIsString

protected boolean m_RowAttributeIsString
whether the row attribute is a string/nominal attribute or not.


m_Indices1

protected int[] m_Indices1
the indices for the first dataset.


m_Indices2

protected int[] m_Indices2
the indices for the second dataset.


m_Lookup2

protected Hashtable<String,Integer> m_Lookup2
the lookup table of indices for the second dataset.


m_Threshold

protected double m_Threshold
the threshold for listing correlations.

Constructor Detail

CompareDatasets

public CompareDatasets()
Method Detail

globalInfo

public String globalInfo()
Returns a string describing the object.

Specified by:
globalInfo in class AbstractOptionHandler
Returns:
a description suitable for displaying in the gui

defineOptions

public void defineOptions()
Adds options to the internal list of options.

Specified by:
defineOptions in interface OptionHandler
Overrides:
defineOptions in class AbstractOptionHandler

initialize

protected void initialize()
Initializes the members.

Overrides:
initialize in class AbstractOptionHandler

setDataset1

public void setDataset1(PlaceholderFile value)
Sets the first dataset for the comparison.

Parameters:
value - the dataset

getDataset1

public PlaceholderFile getDataset1()
Returns the first dataset for the comparison.

Returns:
the dataset

dataset1TipText

public String dataset1TipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the GUI or for listing the options.

setDataset2

public void setDataset2(PlaceholderFile value)
Sets the second dataset for the comparison.

Parameters:
value - the dataset

getDataset2

public PlaceholderFile getDataset2()
Returns the second dataset for the comparison.

Returns:
the dataset

dataset2TipText

public String dataset2TipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the GUI or for listing the options.

setRange1

public void setRange1(Range value)
Sets the range of attributes of the first dataset.

Parameters:
value - the range

getRange1

public Range getRange1()
Returns the range of attributes of the first dataset.

Returns:
the range

range1TipText

public String range1TipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the GUI or for listing the options.

setRange2

public void setRange2(Range value)
Sets the range of attributes of the second dataset.

Parameters:
value - the range

getRange2

public Range getRange2()
Returns the range of attributes of the second dataset.

Returns:
the range

range2TipText

public String range2TipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the GUI or for listing the options.

setRowAttribute1

public void setRowAttribute1(String value)
Sets the index of the attribute used for identifying rows to compare against each other (first dataset).

Parameters:
value - the index

getRowAttribute1

public String getRowAttribute1()
Returns the index of the attribute used for identifying rows to compare against each other (first dataset).

Returns:
the index

rowAttribute1TipText

public String rowAttribute1TipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the GUI or for listing the options.

setRowAttribute2

public void setRowAttribute2(String value)
Sets the index of the attribute used for identifying rows to compare against each other (second dataset).

Parameters:
value - the index

getRowAttribute2

public String getRowAttribute2()
Returns the index of the attribute used for identifying rows to compare against each other (second dataset).

Returns:
the index

rowAttribute2TipText

public String rowAttribute2TipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the GUI or for listing the options.

setOutputFile

public void setOutputFile(PlaceholderFile value)
Sets the first dataset for the comparison.

Specified by:
setOutputFile in interface OutputFileGenerator
Parameters:
value - the dataset

getOutputFile

public PlaceholderFile getOutputFile()
Returns the first dataset for the comparison.

Specified by:
getOutputFile in interface OutputFileGenerator
Returns:
the dataset

outputFileTipText

public String outputFileTipText()
Returns the tip text for this property.

Specified by:
outputFileTipText in interface OutputFileGenerator
Returns:
tip text for this property suitable for displaying in the GUI or for listing the options.

setMissing

public void setMissing(PlaceholderFile value)
Sets the first dataset for the comparison.

Parameters:
value - the dataset

getMissing

public PlaceholderFile getMissing()
Returns the first dataset for the comparison.

Returns:
the dataset

missingTipText

public String missingTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the GUI or for listing the options.

setThreshold

public void setThreshold(double value)
Sets the threshold for the correlation coefficient.

Parameters:
value - the threshold (0.0 turns it off)

getThreshold

public double getThreshold()
Returns the threshold for the correlation coefficient.

Returns:
the threshold (0.0 means it is turned off)

thresholdTipText

public String thresholdTipText()
Returns the tip text for this property.

Returns:
tip text for this property suitable for displaying in the GUI or for listing the options.

preRun

protected void preRun()
Before the actual run is executed.

Overrides:
preRun in class AbstractTool

getUseRowAttribute

protected boolean getUseRowAttribute()
Returns whether to use the row attribute or the order in the datasets for matching up the rows.

Returns:
true if the row attribute is used for matching

getRowID

protected String getRowID(int index)
Returns either the ID for the row, either the row index of the actual row attribute ID for that position.

Parameters:
index - the index to get the ID for
Returns:
the ID

nextByIndex

protected weka.core.Instance[] nextByIndex(int index)
Returns the next pair by simple index.

Parameters:
index - the index of the pair to retrieve
Returns:
the row pair or null if not available

initLookup

protected void initLookup()
Initializes the lookup table of indices for the second dataset, if necessary.


nextByRowAttribute

protected weka.core.Instance[] nextByRowAttribute(int index)
Returns the next pair by using the value of the row attribute.

Parameters:
index - the index of the pair to retrieve
Returns:
the row pair or null if not available

next

protected weka.core.Instance[] next(int index)
Returns the next row pair to compare.

Parameters:
index - the index of the pair to retrieve
Returns:
the row pair or null if not available

getCorrelation

protected double getCorrelation(weka.core.Instance first,
                                weka.core.Instance second)
Returns the correlation between the two rows.

Parameters:
first - the first row
second - the second row
Returns:
the correlation

doRun

protected void doRun()
Performs the comparison.

Specified by:
doRun in class AbstractTool

cleanUp

public void cleanUp()
Cleans up data structures, frees up memory.

Specified by:
cleanUp in interface CleanUpHandler
Overrides:
cleanUp in class AbstractTool


Copyright © 2012 University of Waikato, Hamilton, NZ. All Rights Reserved.