Package adams.tools

Class CompareDatasets

  • All Implemented Interfaces:
    adams.core.CleanUpHandler, adams.core.Destroyable, adams.core.GlobalInfoSupporter, adams.core.io.FileWriter, adams.core.logging.LoggingLevelHandler, adams.core.logging.LoggingSupporter, adams.core.option.OptionHandler, adams.core.SizeOfHandler, adams.core.Stoppable, adams.core.StoppableWithFeedback, adams.tools.OutputFileGenerator, Serializable, Comparable

    public class CompareDatasets
    extends adams.tools.AbstractTool
    implements adams.tools.OutputFileGenerator
    Compares two datasets, either row-by-row or using a row attribute listing a unique ID for matching the rows, outputting the correlation coefficient of the numeric attributes found in the ranges defined by the user.
    In order to trim down the number of generated rows, a threshold can be specified. Only rows are output which correlation coefficient is below that threshold.

    Valid options are:

    -D <int> (property: debugLevel)
        The greater the number the more additional info the scheme may output to
        the console (0 = off).
        default: 0
        minimum: 0
     
    -dataset1 <adams.core.io.PlaceholderFile> (property: dataset1)
        The first dataset in the comparison.
        default: .
     
    -range1 <java.lang.String> (property: range1)
        The range of attributes of the first dataset.
        default: first-last
     
    -row1 <java.lang.String> (property: rowAttribute1)
        The index for the attribute used for identifying rows to compare; if not
        provided, then the comparison is performed row-by-row (first dataset).
        default:
     
    -dataset2 <adams.core.io.PlaceholderFile> (property: dataset2)
        The second dataset in the comparison.
        default: .
     
    -range2 <java.lang.String> (property: range2)
        The range of attributes of the second dataset.
        default: first-last
     
    -row2 <java.lang.String> (property: rowAttribute2)
        The index for the attribute used for identifying rows to compare; if not
        provided, then the comparison is performed row-by-row (second dataset).
        default:
     
    -output <adams.core.io.PlaceholderFile> (property: outputFile)
        The file to save the comparison result in (CSV format).
        default: output.csv
     
    -missing <adams.core.io.PlaceholderFile> (property: missing)
        The file to save the information about missing rows to (CSV format).
        default: missing.csv
     
    -threshold <double> (property: threshold)
        The threshold for the correlation coefficient; only if the coefficient is
        below that threshold, it will get output; 0.0 turns the threshold off.
        default: 0.0
        minimum: 0.0
        maximum: 1.0
     
    Version:
    $Revision$
    Author:
    fracpete (fracpete at waikato dot ac dot nz)
    See Also:
    Serialized Form
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected weka.core.Instances m_Data1
      the current dataset 1.
      protected weka.core.Instances m_Data2
      the current dataset 2.
      protected adams.core.io.PlaceholderFile m_Dataset1
      the first dataset.
      protected adams.core.io.PlaceholderFile m_Dataset2
      the second dataset.
      protected int[] m_Indices1
      the indices for the first dataset.
      protected int[] m_Indices2
      the indices for the second dataset.
      protected Hashtable<String,​Integer> m_Lookup2
      the lookup table of indices for the second dataset.
      protected adams.core.io.PlaceholderFile m_Missing
      the output file for missing tests (CSV format).
      protected adams.core.io.PlaceholderFile m_OutputFile
      the output file (CSV format).
      protected adams.core.Range m_Range1
      the first range of attributes.
      protected adams.core.Range m_Range2
      the second range of attributes.
      protected adams.core.Index m_RowAttribute1
      the optional attribute for matching up rows (dataset 1).
      protected adams.core.Index m_RowAttribute2
      the optional attribute for matching up rows (dataset 2).
      protected boolean m_RowAttributeIsString
      whether the row attribute is a string/nominal attribute or not.
      protected double m_Threshold
      the threshold for listing correlations.
      protected Boolean m_UseRowAttribute
      whether to use the row attribute or not.
      • Fields inherited from class adams.tools.AbstractTool

        m_Stopped
      • Fields inherited from class adams.core.option.AbstractOptionHandler

        m_OptionManager
      • Fields inherited from class adams.core.logging.LoggingObject

        m_Logger, m_LoggingIsEnabled, m_LoggingLevel
    • Constructor Summary

      Constructors 
      Constructor Description
      CompareDatasets()  
    • Method Summary

      All Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      void cleanUp()
      Cleans up data structures, frees up memory.
      String dataset1TipText()
      Returns the tip text for this property.
      String dataset2TipText()
      Returns the tip text for this property.
      void defineOptions()
      Adds options to the internal list of options.
      protected void doRun()
      Performs the comparison.
      protected double getCorrelation​(weka.core.Instance first, weka.core.Instance second)
      Returns the correlation between the two rows.
      adams.core.io.PlaceholderFile getDataset1()
      Returns the first dataset for the comparison.
      adams.core.io.PlaceholderFile getDataset2()
      Returns the second dataset for the comparison.
      adams.core.io.PlaceholderFile getMissing()
      Returns the first dataset for the comparison.
      adams.core.io.PlaceholderFile getOutputFile()
      Returns the first dataset for the comparison.
      adams.core.Range getRange1()
      Returns the range of attributes of the first dataset.
      adams.core.Range getRange2()
      Returns the range of attributes of the second dataset.
      String getRowAttribute1()
      Returns the index of the attribute used for identifying rows to compare against each other (first dataset).
      String getRowAttribute2()
      Returns the index of the attribute used for identifying rows to compare against each other (second dataset).
      protected String getRowID​(int index)
      Returns either the ID for the row, either the row index of the actual row attribute ID for that position.
      double getThreshold()
      Returns the threshold for the correlation coefficient.
      protected boolean getUseRowAttribute()
      Returns whether to use the row attribute or the order in the datasets for matching up the rows.
      String globalInfo()
      Returns a string describing the object.
      protected void initialize()
      Initializes the members.
      protected void initLookup()
      Initializes the lookup table of indices for the second dataset, if necessary.
      String missingTipText()
      Returns the tip text for this property.
      protected weka.core.Instance[] next​(int index)
      Returns the next row pair to compare.
      protected weka.core.Instance[] nextByIndex​(int index)
      Returns the next pair by simple index.
      protected weka.core.Instance[] nextByRowAttribute​(int index)
      Returns the next pair by using the value of the row attribute.
      String outputFileTipText()
      Returns the tip text for this property.
      protected void preRun()
      Before the actual run is executed.
      String range1TipText()
      Returns the tip text for this property.
      String range2TipText()
      Returns the tip text for this property.
      String rowAttribute1TipText()
      Returns the tip text for this property.
      String rowAttribute2TipText()
      Returns the tip text for this property.
      void setDataset1​(adams.core.io.PlaceholderFile value)
      Sets the first dataset for the comparison.
      void setDataset2​(adams.core.io.PlaceholderFile value)
      Sets the second dataset for the comparison.
      void setMissing​(adams.core.io.PlaceholderFile value)
      Sets the first dataset for the comparison.
      void setOutputFile​(adams.core.io.PlaceholderFile value)
      Sets the first dataset for the comparison.
      void setRange1​(adams.core.Range value)
      Sets the range of attributes of the first dataset.
      void setRange2​(adams.core.Range value)
      Sets the range of attributes of the second dataset.
      void setRowAttribute1​(String value)
      Sets the index of the attribute used for identifying rows to compare against each other (first dataset).
      void setRowAttribute2​(String value)
      Sets the index of the attribute used for identifying rows to compare against each other (second dataset).
      void setThreshold​(double value)
      Sets the threshold for the correlation coefficient.
      String thresholdTipText()
      Returns the tip text for this property.
      • Methods inherited from class adams.tools.AbstractTool

        compareTo, destroy, equals, forCommandLine, forName, getTools, isStopped, postRun, run, runTool, stopExecution
      • Methods inherited from class adams.core.option.AbstractOptionHandler

        cleanUpOptions, finishInit, getDefaultLoggingLevel, getOptionManager, loggingLevelTipText, newOptionManager, reset, setLoggingLevel, toCommandLine, toString
      • Methods inherited from class adams.core.logging.LoggingObject

        configureLogger, getLogger, getLoggingLevel, initializeLogging, isLoggingEnabled, sizeOf
      • Methods inherited from interface adams.core.logging.LoggingLevelHandler

        getLoggingLevel
    • Field Detail

      • m_Dataset1

        protected adams.core.io.PlaceholderFile m_Dataset1
        the first dataset.
      • m_Range1

        protected adams.core.Range m_Range1
        the first range of attributes.
      • m_RowAttribute1

        protected adams.core.Index m_RowAttribute1
        the optional attribute for matching up rows (dataset 1).
      • m_Dataset2

        protected adams.core.io.PlaceholderFile m_Dataset2
        the second dataset.
      • m_Range2

        protected adams.core.Range m_Range2
        the second range of attributes.
      • m_RowAttribute2

        protected adams.core.Index m_RowAttribute2
        the optional attribute for matching up rows (dataset 2).
      • m_OutputFile

        protected adams.core.io.PlaceholderFile m_OutputFile
        the output file (CSV format).
      • m_Missing

        protected adams.core.io.PlaceholderFile m_Missing
        the output file for missing tests (CSV format).
      • m_Data1

        protected weka.core.Instances m_Data1
        the current dataset 1.
      • m_Data2

        protected weka.core.Instances m_Data2
        the current dataset 2.
      • m_UseRowAttribute

        protected Boolean m_UseRowAttribute
        whether to use the row attribute or not.
      • m_RowAttributeIsString

        protected boolean m_RowAttributeIsString
        whether the row attribute is a string/nominal attribute or not.
      • m_Indices1

        protected int[] m_Indices1
        the indices for the first dataset.
      • m_Indices2

        protected int[] m_Indices2
        the indices for the second dataset.
      • m_Lookup2

        protected Hashtable<String,​Integer> m_Lookup2
        the lookup table of indices for the second dataset.
      • m_Threshold

        protected double m_Threshold
        the threshold for listing correlations.
    • Constructor Detail

      • CompareDatasets

        public CompareDatasets()
    • Method Detail

      • globalInfo

        public String globalInfo()
        Returns a string describing the object.
        Specified by:
        globalInfo in interface adams.core.GlobalInfoSupporter
        Specified by:
        globalInfo in class adams.core.option.AbstractOptionHandler
        Returns:
        a description suitable for displaying in the gui
      • defineOptions

        public void defineOptions()
        Adds options to the internal list of options.
        Specified by:
        defineOptions in interface adams.core.option.OptionHandler
        Overrides:
        defineOptions in class adams.core.option.AbstractOptionHandler
      • initialize

        protected void initialize()
        Initializes the members.
        Overrides:
        initialize in class adams.core.option.AbstractOptionHandler
      • setDataset1

        public void setDataset1​(adams.core.io.PlaceholderFile value)
        Sets the first dataset for the comparison.
        Parameters:
        value - the dataset
      • getDataset1

        public adams.core.io.PlaceholderFile getDataset1()
        Returns the first dataset for the comparison.
        Returns:
        the dataset
      • dataset1TipText

        public String dataset1TipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setDataset2

        public void setDataset2​(adams.core.io.PlaceholderFile value)
        Sets the second dataset for the comparison.
        Parameters:
        value - the dataset
      • getDataset2

        public adams.core.io.PlaceholderFile getDataset2()
        Returns the second dataset for the comparison.
        Returns:
        the dataset
      • dataset2TipText

        public String dataset2TipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setRange1

        public void setRange1​(adams.core.Range value)
        Sets the range of attributes of the first dataset.
        Parameters:
        value - the range
      • getRange1

        public adams.core.Range getRange1()
        Returns the range of attributes of the first dataset.
        Returns:
        the range
      • range1TipText

        public String range1TipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setRange2

        public void setRange2​(adams.core.Range value)
        Sets the range of attributes of the second dataset.
        Parameters:
        value - the range
      • getRange2

        public adams.core.Range getRange2()
        Returns the range of attributes of the second dataset.
        Returns:
        the range
      • range2TipText

        public String range2TipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setRowAttribute1

        public void setRowAttribute1​(String value)
        Sets the index of the attribute used for identifying rows to compare against each other (first dataset).
        Parameters:
        value - the index
      • getRowAttribute1

        public String getRowAttribute1()
        Returns the index of the attribute used for identifying rows to compare against each other (first dataset).
        Returns:
        the index
      • rowAttribute1TipText

        public String rowAttribute1TipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setRowAttribute2

        public void setRowAttribute2​(String value)
        Sets the index of the attribute used for identifying rows to compare against each other (second dataset).
        Parameters:
        value - the index
      • getRowAttribute2

        public String getRowAttribute2()
        Returns the index of the attribute used for identifying rows to compare against each other (second dataset).
        Returns:
        the index
      • rowAttribute2TipText

        public String rowAttribute2TipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setOutputFile

        public void setOutputFile​(adams.core.io.PlaceholderFile value)
        Sets the first dataset for the comparison.
        Specified by:
        setOutputFile in interface adams.core.io.FileWriter
        Parameters:
        value - the dataset
      • getOutputFile

        public adams.core.io.PlaceholderFile getOutputFile()
        Returns the first dataset for the comparison.
        Specified by:
        getOutputFile in interface adams.core.io.FileWriter
        Returns:
        the dataset
      • outputFileTipText

        public String outputFileTipText()
        Returns the tip text for this property.
        Specified by:
        outputFileTipText in interface adams.core.io.FileWriter
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setMissing

        public void setMissing​(adams.core.io.PlaceholderFile value)
        Sets the first dataset for the comparison.
        Parameters:
        value - the dataset
      • getMissing

        public adams.core.io.PlaceholderFile getMissing()
        Returns the first dataset for the comparison.
        Returns:
        the dataset
      • missingTipText

        public String missingTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setThreshold

        public void setThreshold​(double value)
        Sets the threshold for the correlation coefficient.
        Parameters:
        value - the threshold (0.0 turns it off)
      • getThreshold

        public double getThreshold()
        Returns the threshold for the correlation coefficient.
        Returns:
        the threshold (0.0 means it is turned off)
      • thresholdTipText

        public String thresholdTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • preRun

        protected void preRun()
        Before the actual run is executed.
        Overrides:
        preRun in class adams.tools.AbstractTool
      • getUseRowAttribute

        protected boolean getUseRowAttribute()
        Returns whether to use the row attribute or the order in the datasets for matching up the rows.
        Returns:
        true if the row attribute is used for matching
      • getRowID

        protected String getRowID​(int index)
        Returns either the ID for the row, either the row index of the actual row attribute ID for that position.
        Parameters:
        index - the index to get the ID for
        Returns:
        the ID
      • nextByIndex

        protected weka.core.Instance[] nextByIndex​(int index)
        Returns the next pair by simple index.
        Parameters:
        index - the index of the pair to retrieve
        Returns:
        the row pair or null if not available
      • initLookup

        protected void initLookup()
        Initializes the lookup table of indices for the second dataset, if necessary.
      • nextByRowAttribute

        protected weka.core.Instance[] nextByRowAttribute​(int index)
        Returns the next pair by using the value of the row attribute.
        Parameters:
        index - the index of the pair to retrieve
        Returns:
        the row pair or null if not available
      • next

        protected weka.core.Instance[] next​(int index)
        Returns the next row pair to compare.
        Parameters:
        index - the index of the pair to retrieve
        Returns:
        the row pair or null if not available
      • getCorrelation

        protected double getCorrelation​(weka.core.Instance first,
                                        weka.core.Instance second)
        Returns the correlation between the two rows.
        Parameters:
        first - the first row
        second - the second row
        Returns:
        the correlation
      • doRun

        protected void doRun()
        Performs the comparison.
        Specified by:
        doRun in class adams.tools.AbstractTool
      • cleanUp

        public void cleanUp()
        Cleans up data structures, frees up memory.
        Specified by:
        cleanUp in interface adams.core.CleanUpHandler
        Overrides:
        cleanUp in class adams.tools.AbstractTool