Package adams.tools

Class CompareDatasets

  • All Implemented Interfaces:
    CleanUpHandler, Destroyable, GlobalInfoSupporter, FileWriter, LoggingLevelHandler, LoggingSupporter, OptionHandler, SizeOfHandler, Stoppable, StoppableWithFeedback, OutputFileGenerator, Serializable, Comparable

    public class CompareDatasets
    extends AbstractTool
    implements OutputFileGenerator
    Compares two datasets, either row-by-row or using a row attribute listing a unique ID for matching the rows, outputting the correlation coefficient of the numeric attributes found in the ranges defined by the user.
    In order to trim down the number of generated rows, a threshold can be specified. Only rows are output which correlation coefficient is below that threshold.

    Valid options are:

    -D <int> (property: debugLevel)
        The greater the number the more additional info the scheme may output to
        the console (0 = off).
        default: 0
        minimum: 0
     
    -dataset1 <adams.core.io.PlaceholderFile> (property: dataset1)
        The first dataset in the comparison.
        default: .
     
    -range1 <java.lang.String> (property: range1)
        The range of attributes of the first dataset.
        default: first-last
     
    -row1 <java.lang.String> (property: rowAttribute1)
        The index for the attribute used for identifying rows to compare; if not
        provided, then the comparison is performed row-by-row (first dataset).
        default:
     
    -dataset2 <adams.core.io.PlaceholderFile> (property: dataset2)
        The second dataset in the comparison.
        default: .
     
    -range2 <java.lang.String> (property: range2)
        The range of attributes of the second dataset.
        default: first-last
     
    -row2 <java.lang.String> (property: rowAttribute2)
        The index for the attribute used for identifying rows to compare; if not
        provided, then the comparison is performed row-by-row (second dataset).
        default:
     
    -output <adams.core.io.PlaceholderFile> (property: outputFile)
        The file to save the comparison result in (CSV format).
        default: output.csv
     
    -missing <adams.core.io.PlaceholderFile> (property: missing)
        The file to save the information about missing rows to (CSV format).
        default: missing.csv
     
    -threshold <double> (property: threshold)
        The threshold for the correlation coefficient; only if the coefficient is
        below that threshold, it will get output; 0.0 turns the threshold off.
        default: 0.0
        minimum: 0.0
        maximum: 1.0
     
    Version:
    $Revision$
    Author:
    fracpete (fracpete at waikato dot ac dot nz)
    See Also:
    Serialized Form
    • Field Detail

      • m_Range1

        protected Range m_Range1
        the first range of attributes.
      • m_RowAttribute1

        protected Index m_RowAttribute1
        the optional attribute for matching up rows (dataset 1).
      • m_Range2

        protected Range m_Range2
        the second range of attributes.
      • m_RowAttribute2

        protected Index m_RowAttribute2
        the optional attribute for matching up rows (dataset 2).
      • m_OutputFile

        protected PlaceholderFile m_OutputFile
        the output file (CSV format).
      • m_Missing

        protected PlaceholderFile m_Missing
        the output file for missing tests (CSV format).
      • m_Data1

        protected weka.core.Instances m_Data1
        the current dataset 1.
      • m_Data2

        protected weka.core.Instances m_Data2
        the current dataset 2.
      • m_UseRowAttribute

        protected Boolean m_UseRowAttribute
        whether to use the row attribute or not.
      • m_RowAttributeIsString

        protected boolean m_RowAttributeIsString
        whether the row attribute is a string/nominal attribute or not.
      • m_Indices1

        protected int[] m_Indices1
        the indices for the first dataset.
      • m_Indices2

        protected int[] m_Indices2
        the indices for the second dataset.
      • m_Lookup2

        protected Hashtable<String,​Integer> m_Lookup2
        the lookup table of indices for the second dataset.
      • m_Threshold

        protected double m_Threshold
        the threshold for listing correlations.
    • Constructor Detail

      • CompareDatasets

        public CompareDatasets()
    • Method Detail

      • setDataset1

        public void setDataset1​(PlaceholderFile value)
        Sets the first dataset for the comparison.
        Parameters:
        value - the dataset
      • getDataset1

        public PlaceholderFile getDataset1()
        Returns the first dataset for the comparison.
        Returns:
        the dataset
      • dataset1TipText

        public String dataset1TipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setDataset2

        public void setDataset2​(PlaceholderFile value)
        Sets the second dataset for the comparison.
        Parameters:
        value - the dataset
      • getDataset2

        public PlaceholderFile getDataset2()
        Returns the second dataset for the comparison.
        Returns:
        the dataset
      • dataset2TipText

        public String dataset2TipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setRange1

        public void setRange1​(Range value)
        Sets the range of attributes of the first dataset.
        Parameters:
        value - the range
      • getRange1

        public Range getRange1()
        Returns the range of attributes of the first dataset.
        Returns:
        the range
      • range1TipText

        public String range1TipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setRange2

        public void setRange2​(Range value)
        Sets the range of attributes of the second dataset.
        Parameters:
        value - the range
      • getRange2

        public Range getRange2()
        Returns the range of attributes of the second dataset.
        Returns:
        the range
      • range2TipText

        public String range2TipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setRowAttribute1

        public void setRowAttribute1​(String value)
        Sets the index of the attribute used for identifying rows to compare against each other (first dataset).
        Parameters:
        value - the index
      • getRowAttribute1

        public String getRowAttribute1()
        Returns the index of the attribute used for identifying rows to compare against each other (first dataset).
        Returns:
        the index
      • rowAttribute1TipText

        public String rowAttribute1TipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setRowAttribute2

        public void setRowAttribute2​(String value)
        Sets the index of the attribute used for identifying rows to compare against each other (second dataset).
        Parameters:
        value - the index
      • getRowAttribute2

        public String getRowAttribute2()
        Returns the index of the attribute used for identifying rows to compare against each other (second dataset).
        Returns:
        the index
      • rowAttribute2TipText

        public String rowAttribute2TipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setOutputFile

        public void setOutputFile​(PlaceholderFile value)
        Sets the first dataset for the comparison.
        Specified by:
        setOutputFile in interface FileWriter
        Parameters:
        value - the dataset
      • outputFileTipText

        public String outputFileTipText()
        Returns the tip text for this property.
        Specified by:
        outputFileTipText in interface FileWriter
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setMissing

        public void setMissing​(PlaceholderFile value)
        Sets the first dataset for the comparison.
        Parameters:
        value - the dataset
      • getMissing

        public PlaceholderFile getMissing()
        Returns the first dataset for the comparison.
        Returns:
        the dataset
      • missingTipText

        public String missingTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setThreshold

        public void setThreshold​(double value)
        Sets the threshold for the correlation coefficient.
        Parameters:
        value - the threshold (0.0 turns it off)
      • getThreshold

        public double getThreshold()
        Returns the threshold for the correlation coefficient.
        Returns:
        the threshold (0.0 means it is turned off)
      • thresholdTipText

        public String thresholdTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • preRun

        protected void preRun()
        Before the actual run is executed.
        Overrides:
        preRun in class AbstractTool
      • getUseRowAttribute

        protected boolean getUseRowAttribute()
        Returns whether to use the row attribute or the order in the datasets for matching up the rows.
        Returns:
        true if the row attribute is used for matching
      • getRowID

        protected String getRowID​(int index)
        Returns either the ID for the row, either the row index of the actual row attribute ID for that position.
        Parameters:
        index - the index to get the ID for
        Returns:
        the ID
      • nextByIndex

        protected weka.core.Instance[] nextByIndex​(int index)
        Returns the next pair by simple index.
        Parameters:
        index - the index of the pair to retrieve
        Returns:
        the row pair or null if not available
      • initLookup

        protected void initLookup()
        Initializes the lookup table of indices for the second dataset, if necessary.
      • nextByRowAttribute

        protected weka.core.Instance[] nextByRowAttribute​(int index)
        Returns the next pair by using the value of the row attribute.
        Parameters:
        index - the index of the pair to retrieve
        Returns:
        the row pair or null if not available
      • next

        protected weka.core.Instance[] next​(int index)
        Returns the next row pair to compare.
        Parameters:
        index - the index of the pair to retrieve
        Returns:
        the row pair or null if not available
      • getCorrelation

        protected double getCorrelation​(weka.core.Instance first,
                                        weka.core.Instance second)
        Returns the correlation between the two rows.
        Parameters:
        first - the first row
        second - the second row
        Returns:
        the correlation
      • doRun

        protected void doRun()
        Performs the comparison.
        Specified by:
        doRun in class AbstractTool