Package weka.core.converters
Class ExcelLoader
- java.lang.Object
-
- weka.core.converters.AbstractLoader
-
- weka.core.converters.AbstractFileLoader
-
- weka.core.converters.ExcelLoader
-
- All Implemented Interfaces:
Serializable
,weka.core.converters.BatchConverter
,weka.core.converters.FileSourcedConverter
,weka.core.converters.Loader
,weka.core.EnvironmentHandler
,weka.core.OptionHandler
,weka.core.RevisionHandler
public class ExcelLoader extends weka.core.converters.AbstractFileLoader implements weka.core.converters.BatchConverter, weka.core.OptionHandler
Loads MS Excel spreadsheet files.
Valid options are:-D Enables debug output. (default: off)
-sheet-index <1-based index> The index of the worksheet to load. (default: 1)
-auto-extend-header Enables automatically extending the header. (default: off)
-text-columns <range> The range of columns to treat as text. (default: none)
-no-header If enabled, the spreadsheet is presumed to have no header row. (default: off)
-custom-column-headers <comma-separated list> The headers to use instead (comma-separated list). (default: none)
-first-row <index> The first row in the spreadsheet (starts at 1). (default: 1)
-num-rows <count> The number of rows to read, read all if <1. (default: 0)
-missing-value <regexp> The regular expression for identifying missing values. (default: ^(\?|)$)
-max-labels <int> The maximum number of labels for nominal attributes before they get converted to string. (default: 25)
- Author:
- fracpete (fracpete at waikato dot ac dot nz)
- See Also:
Loader
, Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description protected static adams.core.base.BaseRegExp
DEFAULT_MISSING_VALUE
protected boolean
m_AutoExtendHeader
whether to automatically extend the header if rows have more cells than header.protected String
m_CustomColumnHeaders
the comma-separated list of column header names.protected weka.core.Instances
m_Data
the actual data.protected boolean
m_Debug
whether to print some debug informationprotected int
m_FirstRow
the first row to retrieve (1-based).protected int
m_MaxLabels
the maximum number of labels for nominal attributes.protected adams.core.base.BaseRegExp
m_MissingValue
The placeholder for missing values.protected boolean
m_NoHeader
whether the file has a header or not.protected int
m_NumRows
the number of rows to retrieve (less than 1 = unlimited).protected adams.core.Index
m_SheetIndex
the sheet to read.protected File
m_sourceFile
Holds the source of the data set.protected weka.core.Instances
m_structure
Holds the determined structure (header) of the data set.protected adams.core.Range
m_TextColumns
the range of columns to force to be text.
-
Constructor Summary
Constructors Constructor Description ExcelLoader()
default constructor
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description String
autoExtendHeaderTipText()
The tip text for this property.protected weka.core.Instances
convert(ArrayList<weka.core.Attribute> atts, List<List<Object>> data)
Converts the header/data to instances.String
customColumnHeadersTipText()
The tip text for this property.String
debugTipText()
the tip text for this propertyprotected ArrayList<weka.core.Attribute>
determineAttributes(List<String> header, List<List<Object>> data)
Fixes the columns types, if necessary.String
firstRowTipText()
The tip text for this property.protected void
fixHeader(List<String> header)
Fixes the header, if necessary, by adding a dummy column name.protected void
fixRows(int numColumns, List<List<Object>> data)
Fixes the number of cells in the rows, if necessary, by adding null values.boolean
getAutoExtendHeader()
Returns whether to automatically extend the header if there are more columns present.String
getCustomColumnHeaders()
Returns the custom headers to use.weka.core.Instances
getDataSet()
Return the full data set.boolean
getDebug()
Gets whether additional debug information is printed.String
getFileDescription()
Returns a description of the file type.String
getFileExtension()
Get the file extension used for this type of fileString[]
getFileExtensions()
Gets all the file extensions used for this type of fileint
getFirstRow()
Returns the first row in the worksheet to read.int
getMaxLabels()
Returns the maximum number of labels for nominal attributes before they get converted to string.adams.core.base.BaseRegExp
getMissingValue()
Returns the regular expression for identifying missing values.weka.core.Instance
getNextInstance(weka.core.Instances structure)
SpreadSheetLoader is unable to process a data set incrementally.boolean
getNoHeader()
Returns whether there is now header row in the worksheetint
getNumRows()
Returns the number of rows to read.String[]
getOptions()
Gets the settingString
getRevision()
Returns the revision string.adams.core.Index
getSheetIndex()
Returns the index of the sheet to load.weka.core.Instances
getStructure()
Determines and returns (if possible) the structure (internally the header) of the data set as an empty set of instances.adams.core.Range
getTextColumns()
Returns the range of columns to treat as text/string.String
globalInfo()
Returns a string describing this loaderEnumeration
listOptions()
Lists the available optionsstatic void
main(String[] args)
Main method.String
maxLabelsTipText()
The tip text for this property.String
missingValueTipText()
The tip text for this property.String
noHeaderTipText()
The tip text for this property.protected String
numericToString(org.apache.poi.ss.usermodel.Cell cell)
Turns a numeric cell into a string.String
numRowsTipText()
The tip text for this property.protected weka.core.Instances
readWorksheet()
Reads the worksheet.void
reset()
Resets the loader ready to read a new data setvoid
setAutoExtendHeader(boolean value)
Sets whether to automatically extend the header if there are more columns present.void
setCustomColumnHeaders(String value)
Sets the custom headers to use.void
setDebug(boolean value)
Sets whether to print some debug information.void
setFirstRow(int value)
Sets the first row in the worksheet to read.void
setMaxLabels(int value)
Sets the maximum number of labels for nominal attributes before they get converted to string.void
setMissingValue(adams.core.base.BaseRegExp value)
Sets the regular expression for identifying missing value.void
setNoHeader(boolean value)
Sets whether there is now header row in the worksheet.void
setNumRows(int value)
Sets the number of rows to read.void
setOptions(String[] options)
Parses a given list of options.void
setSheetIndex(adams.core.Index value)
Sets the index of the sheet to load.void
setSource(File file)
Resets the Loader object and sets the source of the data set to be the supplied File object.void
setTextColumns(adams.core.Range value)
Sets the range of columns to treat as text/string.String
sheetIndexTipText()
The tip text for this property.String
textColumnsTipText()
The tip text for this property.-
Methods inherited from class weka.core.converters.AbstractFileLoader
getUseRelativePath, makeOptionStr, retrieveFile, runFileLoader, setEnvironment, setFile, setUseRelativePath, useRelativePathTipText
-
-
-
-
Field Detail
-
m_structure
protected weka.core.Instances m_structure
Holds the determined structure (header) of the data set.
-
m_Data
protected weka.core.Instances m_Data
the actual data.
-
m_sourceFile
protected File m_sourceFile
Holds the source of the data set.
-
m_Debug
protected boolean m_Debug
whether to print some debug information
-
m_SheetIndex
protected adams.core.Index m_SheetIndex
the sheet to read.
-
m_AutoExtendHeader
protected boolean m_AutoExtendHeader
whether to automatically extend the header if rows have more cells than header.
-
m_TextColumns
protected adams.core.Range m_TextColumns
the range of columns to force to be text.
-
m_NoHeader
protected boolean m_NoHeader
whether the file has a header or not.
-
m_CustomColumnHeaders
protected String m_CustomColumnHeaders
the comma-separated list of column header names.
-
m_FirstRow
protected int m_FirstRow
the first row to retrieve (1-based).
-
m_NumRows
protected int m_NumRows
the number of rows to retrieve (less than 1 = unlimited).
-
DEFAULT_MISSING_VALUE
protected static final adams.core.base.BaseRegExp DEFAULT_MISSING_VALUE
-
m_MissingValue
protected adams.core.base.BaseRegExp m_MissingValue
The placeholder for missing values.
-
m_MaxLabels
protected int m_MaxLabels
the maximum number of labels for nominal attributes.
-
-
Method Detail
-
globalInfo
public String globalInfo()
Returns a string describing this loader- Returns:
- a description of the evaluator suitable for displaying in the explorer/experimenter gui
-
listOptions
public Enumeration listOptions()
Lists the available options- Specified by:
listOptions
in interfaceweka.core.OptionHandler
- Returns:
- an enumeration of the available options
-
setOptions
public void setOptions(String[] options) throws Exception
Parses a given list of options.- Specified by:
setOptions
in interfaceweka.core.OptionHandler
- Parameters:
options
- the options- Throws:
Exception
- if options cannot be set
-
getOptions
public String[] getOptions()
Gets the setting- Specified by:
getOptions
in interfaceweka.core.OptionHandler
- Returns:
- the current setting
-
setDebug
public void setDebug(boolean value)
Sets whether to print some debug information.- Parameters:
value
- if true additional debug information will be printed.
-
getDebug
public boolean getDebug()
Gets whether additional debug information is printed.- Returns:
- true if additional debug information is printed
-
debugTipText
public String debugTipText()
the tip text for this property- Returns:
- the tip text
-
setSheetIndex
public void setSheetIndex(adams.core.Index value)
Sets the index of the sheet to load.- Parameters:
value
- the index
-
getSheetIndex
public adams.core.Index getSheetIndex()
Returns the index of the sheet to load.- Returns:
- the index
-
sheetIndexTipText
public String sheetIndexTipText()
The tip text for this property.- Returns:
- the tip text
-
setAutoExtendHeader
public void setAutoExtendHeader(boolean value)
Sets whether to automatically extend the header if there are more columns present.- Parameters:
value
- true if to extend
-
getAutoExtendHeader
public boolean getAutoExtendHeader()
Returns whether to automatically extend the header if there are more columns present.- Returns:
- the reader in use.
-
autoExtendHeaderTipText
public String autoExtendHeaderTipText()
The tip text for this property.- Returns:
- the tip text
-
setTextColumns
public void setTextColumns(adams.core.Range value)
Sets the range of columns to treat as text/string.- Parameters:
value
- the range
-
getTextColumns
public adams.core.Range getTextColumns()
Returns the range of columns to treat as text/string.- Returns:
- the range
-
textColumnsTipText
public String textColumnsTipText()
The tip text for this property.- Returns:
- the tip text
-
setNoHeader
public void setNoHeader(boolean value)
Sets whether there is now header row in the worksheet.- Parameters:
value
- true if no header row
-
getNoHeader
public boolean getNoHeader()
Returns whether there is now header row in the worksheet- Returns:
- true if no header row
-
noHeaderTipText
public String noHeaderTipText()
The tip text for this property.- Returns:
- the tip text
-
setCustomColumnHeaders
public void setCustomColumnHeaders(String value)
Sets the custom headers to use.- Parameters:
value
- the headers (comma-separated list)
-
getCustomColumnHeaders
public String getCustomColumnHeaders()
Returns the custom headers to use.- Returns:
- the headers (comma-separated list)
-
customColumnHeadersTipText
public String customColumnHeadersTipText()
The tip text for this property.- Returns:
- the tip text
-
setFirstRow
public void setFirstRow(int value)
Sets the first row in the worksheet to read.- Parameters:
value
- the row (1-based)
-
getFirstRow
public int getFirstRow()
Returns the first row in the worksheet to read.- Returns:
- the row (1-based)
-
firstRowTipText
public String firstRowTipText()
The tip text for this property.- Returns:
- the tip text
-
setNumRows
public void setNumRows(int value)
Sets the number of rows to read.- Parameters:
value
- the number of rows, <1 for all
-
getNumRows
public int getNumRows()
Returns the number of rows to read.- Returns:
- the number of rows, <1 for all
-
numRowsTipText
public String numRowsTipText()
The tip text for this property.- Returns:
- the tip text
-
setMissingValue
public void setMissingValue(adams.core.base.BaseRegExp value)
Sets the regular expression for identifying missing value.- Parameters:
value
- the regexp
-
getMissingValue
public adams.core.base.BaseRegExp getMissingValue()
Returns the regular expression for identifying missing values.- Returns:
- the regexp
-
missingValueTipText
public String missingValueTipText()
The tip text for this property.- Returns:
- the tip text
-
setMaxLabels
public void setMaxLabels(int value)
Sets the maximum number of labels for nominal attributes before they get converted to string.- Parameters:
value
- the maximum
-
getMaxLabels
public int getMaxLabels()
Returns the maximum number of labels for nominal attributes before they get converted to string.- Returns:
- the maximum
-
maxLabelsTipText
public String maxLabelsTipText()
The tip text for this property.- Returns:
- the tip text
-
getFileDescription
public String getFileDescription()
Returns a description of the file type.- Specified by:
getFileDescription
in interfaceweka.core.converters.FileSourcedConverter
- Returns:
- a short file description
-
getFileExtension
public String getFileExtension()
Get the file extension used for this type of file- Specified by:
getFileExtension
in interfaceweka.core.converters.FileSourcedConverter
- Returns:
- the file extension
-
getFileExtensions
public String[] getFileExtensions()
Gets all the file extensions used for this type of file- Specified by:
getFileExtensions
in interfaceweka.core.converters.FileSourcedConverter
- Returns:
- the file extensions
-
reset
public void reset() throws IOException
Resets the loader ready to read a new data set- Specified by:
reset
in interfaceweka.core.converters.Loader
- Overrides:
reset
in classweka.core.converters.AbstractFileLoader
- Throws:
IOException
-
setSource
public void setSource(File file) throws IOException
Resets the Loader object and sets the source of the data set to be the supplied File object.- Specified by:
setSource
in interfaceweka.core.converters.Loader
- Overrides:
setSource
in classweka.core.converters.AbstractFileLoader
- Parameters:
file
- the source file.- Throws:
IOException
- if an error occurs
-
numericToString
protected String numericToString(org.apache.poi.ss.usermodel.Cell cell)
Turns a numeric cell into a string. Tries to use "long" representation if possible.- Parameters:
cell
- the cell to process- Returns:
- the string representation
-
fixHeader
protected void fixHeader(List<String> header)
Fixes the header, if necessary, by adding a dummy column name.- Parameters:
header
- the header to fix
-
fixRows
protected void fixRows(int numColumns, List<List<Object>> data)
Fixes the number of cells in the rows, if necessary, by adding null values.- Parameters:
numColumns
- the number of columns in the datasetdata
- the data to fix
-
determineAttributes
protected ArrayList<weka.core.Attribute> determineAttributes(List<String> header, List<List<Object>> data)
Fixes the columns types, if necessary.- Parameters:
header
- the column namesdata
- the data to infer the types from- Returns:
- the attributes
-
convert
protected weka.core.Instances convert(ArrayList<weka.core.Attribute> atts, List<List<Object>> data)
Converts the header/data to instances.- Parameters:
atts
- the attributesdata
- the data- Returns:
- the generated data
-
readWorksheet
protected weka.core.Instances readWorksheet()
Reads the worksheet.- Returns:
- the worksheet data
-
getStructure
public weka.core.Instances getStructure() throws IOException
Determines and returns (if possible) the structure (internally the header) of the data set as an empty set of instances.- Specified by:
getStructure
in interfaceweka.core.converters.Loader
- Specified by:
getStructure
in classweka.core.converters.AbstractLoader
- Returns:
- the structure of the data set as an empty set of Instances
- Throws:
IOException
- if an error occurs
-
getDataSet
public weka.core.Instances getDataSet() throws IOException
Return the full data set. If the structure hasn't yet been determined by a call to getStructure then method should do so before processing the rest of the data set.- Specified by:
getDataSet
in interfaceweka.core.converters.Loader
- Specified by:
getDataSet
in classweka.core.converters.AbstractLoader
- Returns:
- the structure of the data set as an empty set of Instances
- Throws:
IOException
- if there is no source or parsing fails
-
getNextInstance
public weka.core.Instance getNextInstance(weka.core.Instances structure) throws IOException
SpreadSheetLoader is unable to process a data set incrementally.- Specified by:
getNextInstance
in interfaceweka.core.converters.Loader
- Specified by:
getNextInstance
in classweka.core.converters.AbstractLoader
- Parameters:
structure
- ignored- Returns:
- never returns without throwing an exception
- Throws:
IOException
- always. AdamsCsvLoader is unable to process a data set incrementally.
-
getRevision
public String getRevision()
Returns the revision string.- Specified by:
getRevision
in interfaceweka.core.RevisionHandler
- Returns:
- the revision
-
main
public static void main(String[] args)
Main method.- Parameters:
args
- should contain the name of an input file.
-
-