Class InterquartileRangeSamp

  • All Implemented Interfaces:
    Serializable, weka.core.CapabilitiesHandler, weka.core.CapabilitiesIgnorer, weka.core.CommandlineRunnable, weka.core.OptionHandler, weka.core.RevisionHandler, weka.core.WeightedAttributesHandler

    public class InterquartileRangeSamp
    extends weka.filters.unsupervised.attribute.InterquartileRange
    A sampling filter for detecting outliers and extreme values based on interquartile ranges. The filter skips the class attribute.

    Outliers:
    Q3 + OF*IQR < x <= Q3 + EVF*IQR
    or
    Q1 - EVF*IQR <= x < Q1 - OF*IQR

    Extreme values:
    x > Q3 + EVF*IQR
    or
    x < Q1 - EVF*IQR

    Key:
    Q1 = 25% quartile
    Q3 = 75% quartile
    IQR = Interquartile Range, difference between Q1 and Q3
    OF = Outlier Factor
    EVF = Extreme Value Factor

    Valid options are:

     -sample-size <value>
      The sample size to use.
      (default: 150)
     -min-samples <value>
      The minimum number of samples that are required for calculating IQR stats.
      (default: 5)
     -ignored-attributes <value>
      The regular expression for attributes to ignore/skip.
      (default: ^.*_id$)
     -R <col1,col2-col4,...>
      Specifies list of columns to base outlier/extreme value detection
      on. If an instance is considered in at least one of those
      attributes an outlier/extreme value, it is tagged accordingly.
      'first' and 'last' are valid indexes.
      (default none)
     -O <num>
      The factor for outlier detection.
      (default: 3)
     -E <num>
      The factor for extreme values detection.
      (default: 2*Outlier Factor)
     -E-as-O
      Tags extreme values also as outliers.
      (default: off)
     -P
      Generates Outlier/ExtremeValue pair for each numeric attribute in
      the range, not just a single indicator pair for all the attributes.
      (default: off)
     -M
      Generates an additional attribute 'Offset' per Outlier/ExtremeValue
      pair that contains the multiplier that the value is off the median.
         value = median + 'multiplier' * IQR
     Note: implicitely sets '-P'. (default: off)
     -output-debug-info
      If set, filter is run in debug mode and
      may output additional info to the console
     -do-not-check-capabilities
      If set, filter capabilities are not checked before filter is built
      (use with caution).
    Thanks to Dale for a few brainstorming sessions.
    Version:
    $Revision$
    Author:
    Dale Fletcher (dale at cs dot waikato dot ac dot nz), fracpete (fracpete at waikato dot ac dot nz)
    See Also:
    Serialized Form
    • Nested Class Summary

      Nested Classes 
      Modifier and Type Class Description
      static class  InterquartileRangeSamp.IQRs
      Container class for the IQR values.
      • Nested classes/interfaces inherited from class weka.filters.unsupervised.attribute.InterquartileRange

        weka.filters.unsupervised.attribute.InterquartileRange.ValueType
    • Field Summary

      Fields 
      Modifier and Type Field Description
      static String IGNORED_ATTRIBUTES  
      protected Hashtable<Integer,​gnu.trove.list.array.TDoubleArrayList> m_AttValues  
      protected BaseRegExp m_IgnoredAttributes
      the regular expression for attributes to skip.
      protected Hashtable<Integer,​List<InterquartileRangeSamp.IQRs>> m_IQRs  
      protected int m_MinSamples
      the minimum number of samples.
      protected int m_SampleSize
      the sample size to use.
      static String MIN_SAMPLES  
      static String SAMPLE_SIZE  
      protected static long serialVersionUID
      for serialization
      • Fields inherited from class weka.filters.unsupervised.attribute.InterquartileRange

        m_AttributeIndices, m_Attributes, m_DetectionPerAttribute, m_ExtremeValuesAsOutliers, m_ExtremeValuesFactor, m_IQR, m_LowerExtremeValue, m_LowerOutlier, m_Median, m_OutlierAttributePosition, m_OutlierFactor, m_OutputOffsetMultiplier, m_UpperExtremeValue, m_UpperOutlier, NON_NUMERIC
      • Fields inherited from class weka.filters.Filter

        m_Debug, m_DoNotCheckCapabilities, m_FirstBatchDone, m_InputRelAtts, m_InputStringAtts, m_NewBatch, m_OutputRelAtts, m_OutputStringAtts
    • Method Summary

      All Methods Static Methods Instance Methods Concrete Methods 
      Modifier and Type Method Description
      protected void addIQR​(Integer key, gnu.trove.list.array.TDoubleArrayList v)
      Calculates and adds the IQR stats for this key.
      protected void clearRemainder()  
      protected void computeThresholds​(weka.core.Instances instances)
      computes the thresholds for outliers and extreme values
      protected BaseRegExp getDefaultIgnoredAttributes()
      Returns the default regular expression for ignored/skipped attributes.
      protected int getDefaultMinSamples()
      Returns the default minimum number of samples.
      protected int getDefaultSampleSize()
      Returns the default sample size.
      BaseRegExp getIgnoredAttributes()
      Returns the regular expression for ignored/skipped attributes.
      int getMinSamples()
      Returns the minimum number of samples that are required for calculating IQR stats.
      String[] getOptions()
      Gets the current option settings for the OptionHandler.
      int getSampleSize()
      Returns the sample size to use.
      String globalInfo()
      Returns a string describing this filter
      String ignoredAttributesTipText()
      Returns the tip text for this property.
      Enumeration listOptions()
      Returns an enumeration describing the available options.
      static void main​(String[] args)
      Main method for testing this class.
      String minSamplesTipText()
      Returns the tip text for this property.
      String sampleSizeTipText()
      Returns the tip text for this property.
      void setIgnoredAttributes​(BaseRegExp value)
      Sets the regular expression for ignored/skipped attributes.
      void setMinSamples​(int value)
      Sets the minimum number of samples that are required for calculating IQR stats.
      void setOptions​(String[] options)
      Sets the OptionHandler's options using the given list.
      void setSampleSize​(int value)
      Sets the sample size to use.
      protected double valueAtPct​(double[] sorted_arr, double pct)
      Calculates the value at the specified percentage.
      • Methods inherited from class weka.filters.unsupervised.attribute.InterquartileRange

        attributeIndicesTipText, calculateMultiplier, detectionPerAttributeTipText, determineOutputFormat, extremeValuesAsOutliersTipText, extremeValuesFactorTipText, getAttributeIndices, getCapabilities, getDetectionPerAttribute, getExtremeValuesAsOutliers, getExtremeValuesFactor, getOutlierFactor, getOutputOffsetMultiplier, getRevision, getValues, isExtremeValue, isExtremeValue, isOutlier, isOutlier, outlierFactorTipText, outputOffsetMultiplierTipText, process, setAttributeIndices, setAttributeIndicesArray, setDetectionPerAttribute, setExtremeValuesAsOutliers, setExtremeValuesFactor, setOutlierFactor, setOutputOffsetMultiplier
      • Methods inherited from class weka.filters.SimpleBatchFilter

        allowAccessToFullInputFormat, batchFinished, hasImmediateOutputFormat, input, input
      • Methods inherited from class weka.filters.SimpleFilter

        reset, setInputFormat
      • Methods inherited from class weka.filters.Filter

        batchFilterFile, bufferInput, copyValues, copyValues, debugTipText, doNotCheckCapabilitiesTipText, filterFile, flushInput, getCapabilities, getCopyOfInputFormat, getDebug, getDoNotCheckCapabilities, getInputFormat, getOutputFormat, initInputLocators, initOutputLocators, inputFormatPeek, isFirstBatchDone, isNewBatch, isOutputFormatDefined, makeCopies, makeCopy, mayRemoveInstanceAfterFirstBatchDone, numPendingOutput, output, outputFormatPeek, outputPeek, postExecution, preExecution, push, push, resetQueue, run, runFilter, setDebug, setDoNotCheckCapabilities, setOutputFormat, testInputFormat, toString, useFilter, wekaStaticWrapper
    • Constructor Detail

      • InterquartileRangeSamp

        public InterquartileRangeSamp()
    • Method Detail

      • globalInfo

        public String globalInfo()
        Returns a string describing this filter
        Overrides:
        globalInfo in class weka.filters.unsupervised.attribute.InterquartileRange
        Returns:
        a description of the filter suitable for displaying in the explorer/experimenter gui
      • getDefaultSampleSize

        protected int getDefaultSampleSize()
        Returns the default sample size.
        Returns:
        the default
      • setSampleSize

        public void setSampleSize​(int value)
        Sets the sample size to use.
        Parameters:
        value - the size
      • getSampleSize

        public int getSampleSize()
        Returns the sample size to use.
        Returns:
        the samples
      • sampleSizeTipText

        public String sampleSizeTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the gui
      • getDefaultMinSamples

        protected int getDefaultMinSamples()
        Returns the default minimum number of samples.
        Returns:
        the default
      • setMinSamples

        public void setMinSamples​(int value)
        Sets the minimum number of samples that are required for calculating IQR stats.
        Parameters:
        value - the samples
      • getMinSamples

        public int getMinSamples()
        Returns the minimum number of samples that are required for calculating IQR stats.
        Returns:
        the samples
      • minSamplesTipText

        public String minSamplesTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the gui
      • getDefaultIgnoredAttributes

        protected BaseRegExp getDefaultIgnoredAttributes()
        Returns the default regular expression for ignored/skipped attributes.
        Returns:
        the default
      • setIgnoredAttributes

        public void setIgnoredAttributes​(BaseRegExp value)
        Sets the regular expression for ignored/skipped attributes.
        Parameters:
        value - the regexp
      • getIgnoredAttributes

        public BaseRegExp getIgnoredAttributes()
        Returns the regular expression for ignored/skipped attributes.
        Returns:
        the regexp
      • ignoredAttributesTipText

        public String ignoredAttributesTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the gui
      • listOptions

        public Enumeration listOptions()
        Returns an enumeration describing the available options.
        Specified by:
        listOptions in interface weka.core.OptionHandler
        Overrides:
        listOptions in class weka.filters.unsupervised.attribute.InterquartileRange
        Returns:
        an enumeration of all the available options.
      • setOptions

        public void setOptions​(String[] options)
                        throws Exception
        Sets the OptionHandler's options using the given list. All options will be set (or reset) during this call (i.e. incremental setting of options is not possible).
        Specified by:
        setOptions in interface weka.core.OptionHandler
        Overrides:
        setOptions in class weka.filters.unsupervised.attribute.InterquartileRange
        Parameters:
        options - the list of options as an array of strings
        Throws:
        Exception - if an option is not supported
      • getOptions

        public String[] getOptions()
        Gets the current option settings for the OptionHandler.
        Specified by:
        getOptions in interface weka.core.OptionHandler
        Overrides:
        getOptions in class weka.filters.unsupervised.attribute.InterquartileRange
        Returns:
        the list of current option settings as an array of strings
      • addIQR

        protected void addIQR​(Integer key,
                              gnu.trove.list.array.TDoubleArrayList v)
        Calculates and adds the IQR stats for this key.
        Parameters:
        key - the key for the stats
        v - the values
      • valueAtPct

        protected double valueAtPct​(double[] sorted_arr,
                                    double pct)
        Calculates the value at the specified percentage.
        Parameters:
        sorted_arr - the sorted array to use
        pct - the percent
        Returns:
        the value
      • clearRemainder

        protected void clearRemainder()
      • computeThresholds

        protected void computeThresholds​(weka.core.Instances instances)
        computes the thresholds for outliers and extreme values
        Overrides:
        computeThresholds in class weka.filters.unsupervised.attribute.InterquartileRange
        Parameters:
        instances - the data to work on
      • main

        public static void main​(String[] args)
        Main method for testing this class.
        Parameters:
        args - should contain arguments to the filter: use -h for help