Class InterquartileRangeSamp
- java.lang.Object
-
- weka.filters.Filter
-
- weka.filters.SimpleFilter
-
- weka.filters.SimpleBatchFilter
-
- weka.filters.unsupervised.attribute.InterquartileRange
-
- weka.filters.unsupervised.attribute.InterquartileRangeSamp
-
- All Implemented Interfaces:
Serializable
,weka.core.CapabilitiesHandler
,weka.core.CapabilitiesIgnorer
,weka.core.CommandlineRunnable
,weka.core.OptionHandler
,weka.core.RevisionHandler
,weka.core.WeightedAttributesHandler
public class InterquartileRangeSamp extends weka.filters.unsupervised.attribute.InterquartileRange
A sampling filter for detecting outliers and extreme values based on interquartile ranges. The filter skips the class attribute.
Outliers:
Q3 + OF*IQR < x <= Q3 + EVF*IQR
or
Q1 - EVF*IQR <= x < Q1 - OF*IQR
Extreme values:
x > Q3 + EVF*IQR
or
x < Q1 - EVF*IQR
Key:
Q1 = 25% quartile
Q3 = 75% quartile
IQR = Interquartile Range, difference between Q1 and Q3
OF = Outlier Factor
EVF = Extreme Value Factor Valid options are:-sample-size <value> The sample size to use. (default: 150)
-min-samples <value> The minimum number of samples that are required for calculating IQR stats. (default: 5)
-ignored-attributes <value> The regular expression for attributes to ignore/skip. (default: ^.*_id$)
-R <col1,col2-col4,...> Specifies list of columns to base outlier/extreme value detection on. If an instance is considered in at least one of those attributes an outlier/extreme value, it is tagged accordingly. 'first' and 'last' are valid indexes. (default none)
-O <num> The factor for outlier detection. (default: 3)
-E <num> The factor for extreme values detection. (default: 2*Outlier Factor)
-E-as-O Tags extreme values also as outliers. (default: off)
-P Generates Outlier/ExtremeValue pair for each numeric attribute in the range, not just a single indicator pair for all the attributes. (default: off)
-M Generates an additional attribute 'Offset' per Outlier/ExtremeValue pair that contains the multiplier that the value is off the median. value = median + 'multiplier' * IQR Note: implicitely sets '-P'. (default: off)
-output-debug-info If set, filter is run in debug mode and may output additional info to the console
-do-not-check-capabilities If set, filter capabilities are not checked before filter is built (use with caution).
Thanks to Dale for a few brainstorming sessions.- Author:
- Dale Fletcher (dale at cs dot waikato dot ac dot nz), fracpete (fracpete at waikato dot ac dot nz)
- See Also:
- Serialized Form
-
-
Nested Class Summary
Nested Classes Modifier and Type Class Description static class
InterquartileRangeSamp.IQRs
Container class for the IQR values.
-
Field Summary
Fields Modifier and Type Field Description static String
IGNORED_ATTRIBUTES
protected Hashtable<Integer,gnu.trove.list.array.TDoubleArrayList>
m_AttValues
protected BaseRegExp
m_IgnoredAttributes
the regular expression for attributes to skip.protected Hashtable<Integer,List<InterquartileRangeSamp.IQRs>>
m_IQRs
protected int
m_MinSamples
the minimum number of samples.protected int
m_SampleSize
the sample size to use.static String
MIN_SAMPLES
static String
SAMPLE_SIZE
-
Fields inherited from class weka.filters.unsupervised.attribute.InterquartileRange
m_AttributeIndices, m_Attributes, m_DetectionPerAttribute, m_ExtremeValuesAsOutliers, m_ExtremeValuesFactor, m_IQR, m_LowerExtremeValue, m_LowerOutlier, m_Median, m_OutlierAttributePosition, m_OutlierFactor, m_OutputOffsetMultiplier, m_UpperExtremeValue, m_UpperOutlier, NON_NUMERIC
-
-
Constructor Summary
Constructors Constructor Description InterquartileRangeSamp()
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description protected void
addIQR(Integer key, gnu.trove.list.array.TDoubleArrayList v)
Calculates and adds the IQR stats for this key.protected void
clearRemainder()
protected void
computeThresholds(weka.core.Instances instances)
computes the thresholds for outliers and extreme valuesprotected BaseRegExp
getDefaultIgnoredAttributes()
Returns the default regular expression for ignored/skipped attributes.protected int
getDefaultMinSamples()
Returns the default minimum number of samples.protected int
getDefaultSampleSize()
Returns the default sample size.BaseRegExp
getIgnoredAttributes()
Returns the regular expression for ignored/skipped attributes.int
getMinSamples()
Returns the minimum number of samples that are required for calculating IQR stats.String[]
getOptions()
Gets the current option settings for the OptionHandler.int
getSampleSize()
Returns the sample size to use.String
globalInfo()
Returns a string describing this filterString
ignoredAttributesTipText()
Returns the tip text for this property.Enumeration
listOptions()
Returns an enumeration describing the available options.static void
main(String[] args)
Main method for testing this class.String
minSamplesTipText()
Returns the tip text for this property.String
sampleSizeTipText()
Returns the tip text for this property.void
setIgnoredAttributes(BaseRegExp value)
Sets the regular expression for ignored/skipped attributes.void
setMinSamples(int value)
Sets the minimum number of samples that are required for calculating IQR stats.void
setOptions(String[] options)
Sets the OptionHandler's options using the given list.void
setSampleSize(int value)
Sets the sample size to use.protected double
valueAtPct(double[] sorted_arr, double pct)
Calculates the value at the specified percentage.-
Methods inherited from class weka.filters.unsupervised.attribute.InterquartileRange
attributeIndicesTipText, calculateMultiplier, detectionPerAttributeTipText, determineOutputFormat, extremeValuesAsOutliersTipText, extremeValuesFactorTipText, getAttributeIndices, getCapabilities, getDetectionPerAttribute, getExtremeValuesAsOutliers, getExtremeValuesFactor, getOutlierFactor, getOutputOffsetMultiplier, getRevision, getValues, isExtremeValue, isExtremeValue, isOutlier, isOutlier, outlierFactorTipText, outputOffsetMultiplierTipText, process, setAttributeIndices, setAttributeIndicesArray, setDetectionPerAttribute, setExtremeValuesAsOutliers, setExtremeValuesFactor, setOutlierFactor, setOutputOffsetMultiplier
-
Methods inherited from class weka.filters.SimpleBatchFilter
allowAccessToFullInputFormat, batchFinished, hasImmediateOutputFormat, input, input
-
Methods inherited from class weka.filters.Filter
batchFilterFile, bufferInput, copyValues, copyValues, debugTipText, doNotCheckCapabilitiesTipText, filterFile, flushInput, getCapabilities, getCopyOfInputFormat, getDebug, getDoNotCheckCapabilities, getInputFormat, getOutputFormat, initInputLocators, initOutputLocators, inputFormatPeek, isFirstBatchDone, isNewBatch, isOutputFormatDefined, makeCopies, makeCopy, mayRemoveInstanceAfterFirstBatchDone, numPendingOutput, output, outputFormatPeek, outputPeek, postExecution, preExecution, push, push, resetQueue, run, runFilter, setDebug, setDoNotCheckCapabilities, setOutputFormat, testInputFormat, toString, useFilter, wekaStaticWrapper
-
-
-
-
Field Detail
-
SAMPLE_SIZE
public static final String SAMPLE_SIZE
- See Also:
- Constant Field Values
-
MIN_SAMPLES
public static final String MIN_SAMPLES
- See Also:
- Constant Field Values
-
IGNORED_ATTRIBUTES
public static final String IGNORED_ATTRIBUTES
- See Also:
- Constant Field Values
-
m_IQRs
protected Hashtable<Integer,List<InterquartileRangeSamp.IQRs>> m_IQRs
-
m_SampleSize
protected int m_SampleSize
the sample size to use.
-
m_MinSamples
protected int m_MinSamples
the minimum number of samples.
-
m_IgnoredAttributes
protected BaseRegExp m_IgnoredAttributes
the regular expression for attributes to skip.
-
-
Method Detail
-
globalInfo
public String globalInfo()
Returns a string describing this filter- Overrides:
globalInfo
in classweka.filters.unsupervised.attribute.InterquartileRange
- Returns:
- a description of the filter suitable for displaying in the explorer/experimenter gui
-
getDefaultSampleSize
protected int getDefaultSampleSize()
Returns the default sample size.- Returns:
- the default
-
setSampleSize
public void setSampleSize(int value)
Sets the sample size to use.- Parameters:
value
- the size
-
getSampleSize
public int getSampleSize()
Returns the sample size to use.- Returns:
- the samples
-
sampleSizeTipText
public String sampleSizeTipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the gui
-
getDefaultMinSamples
protected int getDefaultMinSamples()
Returns the default minimum number of samples.- Returns:
- the default
-
setMinSamples
public void setMinSamples(int value)
Sets the minimum number of samples that are required for calculating IQR stats.- Parameters:
value
- the samples
-
getMinSamples
public int getMinSamples()
Returns the minimum number of samples that are required for calculating IQR stats.- Returns:
- the samples
-
minSamplesTipText
public String minSamplesTipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the gui
-
getDefaultIgnoredAttributes
protected BaseRegExp getDefaultIgnoredAttributes()
Returns the default regular expression for ignored/skipped attributes.- Returns:
- the default
-
setIgnoredAttributes
public void setIgnoredAttributes(BaseRegExp value)
Sets the regular expression for ignored/skipped attributes.- Parameters:
value
- the regexp
-
getIgnoredAttributes
public BaseRegExp getIgnoredAttributes()
Returns the regular expression for ignored/skipped attributes.- Returns:
- the regexp
-
ignoredAttributesTipText
public String ignoredAttributesTipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the gui
-
listOptions
public Enumeration listOptions()
Returns an enumeration describing the available options.- Specified by:
listOptions
in interfaceweka.core.OptionHandler
- Overrides:
listOptions
in classweka.filters.unsupervised.attribute.InterquartileRange
- Returns:
- an enumeration of all the available options.
-
setOptions
public void setOptions(String[] options) throws Exception
Sets the OptionHandler's options using the given list. All options will be set (or reset) during this call (i.e. incremental setting of options is not possible).- Specified by:
setOptions
in interfaceweka.core.OptionHandler
- Overrides:
setOptions
in classweka.filters.unsupervised.attribute.InterquartileRange
- Parameters:
options
- the list of options as an array of strings- Throws:
Exception
- if an option is not supported
-
getOptions
public String[] getOptions()
Gets the current option settings for the OptionHandler.- Specified by:
getOptions
in interfaceweka.core.OptionHandler
- Overrides:
getOptions
in classweka.filters.unsupervised.attribute.InterquartileRange
- Returns:
- the list of current option settings as an array of strings
-
addIQR
protected void addIQR(Integer key, gnu.trove.list.array.TDoubleArrayList v)
Calculates and adds the IQR stats for this key.- Parameters:
key
- the key for the statsv
- the values
-
valueAtPct
protected double valueAtPct(double[] sorted_arr, double pct)
Calculates the value at the specified percentage.- Parameters:
sorted_arr
- the sorted array to usepct
- the percent- Returns:
- the value
-
clearRemainder
protected void clearRemainder()
-
computeThresholds
protected void computeThresholds(weka.core.Instances instances)
computes the thresholds for outliers and extreme values- Overrides:
computeThresholds
in classweka.filters.unsupervised.attribute.InterquartileRange
- Parameters:
instances
- the data to work on
-
main
public static void main(String[] args)
Main method for testing this class.- Parameters:
args
- should contain arguments to the filter: use -h for help
-
-