Package weka.core.converters
Class ExcelLoader
- java.lang.Object
-
- weka.core.converters.AbstractLoader
-
- weka.core.converters.AbstractFileLoader
-
- weka.core.converters.ExcelLoader
-
- All Implemented Interfaces:
Serializable,weka.core.converters.BatchConverter,weka.core.converters.FileSourcedConverter,weka.core.converters.Loader,weka.core.EnvironmentHandler,weka.core.OptionHandler,weka.core.RevisionHandler
public class ExcelLoader extends weka.core.converters.AbstractFileLoader implements weka.core.converters.BatchConverter, weka.core.OptionHandlerLoads MS Excel spreadsheet files.
Valid options are:-D Enables debug output. (default: off)
-sheet-index <1-based index> The index of the worksheet to load. (default: 1)
-auto-extend-header Enables automatically extending the header. (default: off)
-text-columns <range> The range of columns to treat as text. (default: none)
-no-header If enabled, the spreadsheet is presumed to have no header row. (default: off)
-custom-column-headers <comma-separated list> The headers to use instead (comma-separated list). (default: none)
-first-row <index> The first row in the spreadsheet (starts at 1). (default: 1)
-num-rows <count> The number of rows to read, read all if <1. (default: 0)
-missing-value <regexp> The regular expression for identifying missing values. (default: ^(\?|)$)
-max-labels <int> The maximum number of labels for nominal attributes before they get converted to string. (default: 25)
- Author:
- fracpete (fracpete at waikato dot ac dot nz)
- See Also:
Loader, Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description protected static BaseRegExpDEFAULT_MISSING_VALUEprotected booleanm_AutoExtendHeaderwhether to automatically extend the header if rows have more cells than header.protected Stringm_CustomColumnHeadersthe comma-separated list of column header names.protected weka.core.Instancesm_Datathe actual data.protected booleanm_Debugwhether to print some debug informationprotected intm_FirstRowthe first row to retrieve (1-based).protected intm_MaxLabelsthe maximum number of labels for nominal attributes.protected BaseRegExpm_MissingValueThe placeholder for missing values.protected booleanm_NoHeaderwhether the file has a header or not.protected intm_NumRowsthe number of rows to retrieve (less than 1 = unlimited).protected Indexm_SheetIndexthe sheet to read.protected Filem_sourceFileHolds the source of the data set.protected weka.core.Instancesm_structureHolds the determined structure (header) of the data set.protected Rangem_TextColumnsthe range of columns to force to be text.
-
Constructor Summary
Constructors Constructor Description ExcelLoader()default constructor
-
Method Summary
All Methods Static Methods Instance Methods Concrete Methods Modifier and Type Method Description StringautoExtendHeaderTipText()The tip text for this property.protected weka.core.Instancesconvert(ArrayList<weka.core.Attribute> atts, List<List<Object>> data)Converts the header/data to instances.StringcustomColumnHeadersTipText()The tip text for this property.StringdebugTipText()the tip text for this propertyprotected ArrayList<weka.core.Attribute>determineAttributes(List<String> header, List<List<Object>> data)Fixes the columns types, if necessary.StringfirstRowTipText()The tip text for this property.protected voidfixHeader(List<String> header)Fixes the header, if necessary, by adding a dummy column name.protected voidfixRows(int numColumns, List<List<Object>> data)Fixes the number of cells in the rows, if necessary, by adding null values.booleangetAutoExtendHeader()Returns whether to automatically extend the header if there are more columns present.StringgetCustomColumnHeaders()Returns the custom headers to use.weka.core.InstancesgetDataSet()Return the full data set.booleangetDebug()Gets whether additional debug information is printed.StringgetFileDescription()Returns a description of the file type.StringgetFileExtension()Get the file extension used for this type of fileString[]getFileExtensions()Gets all the file extensions used for this type of fileintgetFirstRow()Returns the first row in the worksheet to read.intgetMaxLabels()Returns the maximum number of labels for nominal attributes before they get converted to string.BaseRegExpgetMissingValue()Returns the regular expression for identifying missing values.weka.core.InstancegetNextInstance(weka.core.Instances structure)SpreadSheetLoader is unable to process a data set incrementally.booleangetNoHeader()Returns whether there is now header row in the worksheetintgetNumRows()Returns the number of rows to read.String[]getOptions()Gets the settingStringgetRevision()Returns the revision string.IndexgetSheetIndex()Returns the index of the sheet to load.weka.core.InstancesgetStructure()Determines and returns (if possible) the structure (internally the header) of the data set as an empty set of instances.RangegetTextColumns()Returns the range of columns to treat as text/string.StringglobalInfo()Returns a string describing this loaderEnumerationlistOptions()Lists the available optionsstatic voidmain(String[] args)Main method.StringmaxLabelsTipText()The tip text for this property.StringmissingValueTipText()The tip text for this property.StringnoHeaderTipText()The tip text for this property.protected StringnumericToString(org.apache.poi.ss.usermodel.Cell cell)Turns a numeric cell into a string.StringnumRowsTipText()The tip text for this property.protected weka.core.InstancesreadWorksheet()Reads the worksheet.voidreset()Resets the loader ready to read a new data setvoidsetAutoExtendHeader(boolean value)Sets whether to automatically extend the header if there are more columns present.voidsetCustomColumnHeaders(String value)Sets the custom headers to use.voidsetDebug(boolean value)Sets whether to print some debug information.voidsetFirstRow(int value)Sets the first row in the worksheet to read.voidsetMaxLabels(int value)Sets the maximum number of labels for nominal attributes before they get converted to string.voidsetMissingValue(BaseRegExp value)Sets the regular expression for identifying missing value.voidsetNoHeader(boolean value)Sets whether there is now header row in the worksheet.voidsetNumRows(int value)Sets the number of rows to read.voidsetOptions(String[] options)Parses a given list of options.voidsetSheetIndex(Index value)Sets the index of the sheet to load.voidsetSource(File file)Resets the Loader object and sets the source of the data set to be the supplied File object.voidsetTextColumns(Range value)Sets the range of columns to treat as text/string.StringsheetIndexTipText()The tip text for this property.StringtextColumnsTipText()The tip text for this property.-
Methods inherited from class weka.core.converters.AbstractFileLoader
getUseRelativePath, makeOptionStr, retrieveFile, runFileLoader, setEnvironment, setFile, setUseRelativePath, useRelativePathTipText
-
-
-
-
Field Detail
-
m_structure
protected weka.core.Instances m_structure
Holds the determined structure (header) of the data set.
-
m_Data
protected weka.core.Instances m_Data
the actual data.
-
m_sourceFile
protected File m_sourceFile
Holds the source of the data set.
-
m_Debug
protected boolean m_Debug
whether to print some debug information
-
m_SheetIndex
protected Index m_SheetIndex
the sheet to read.
-
m_AutoExtendHeader
protected boolean m_AutoExtendHeader
whether to automatically extend the header if rows have more cells than header.
-
m_TextColumns
protected Range m_TextColumns
the range of columns to force to be text.
-
m_NoHeader
protected boolean m_NoHeader
whether the file has a header or not.
-
m_CustomColumnHeaders
protected String m_CustomColumnHeaders
the comma-separated list of column header names.
-
m_FirstRow
protected int m_FirstRow
the first row to retrieve (1-based).
-
m_NumRows
protected int m_NumRows
the number of rows to retrieve (less than 1 = unlimited).
-
DEFAULT_MISSING_VALUE
protected static final BaseRegExp DEFAULT_MISSING_VALUE
-
m_MissingValue
protected BaseRegExp m_MissingValue
The placeholder for missing values.
-
m_MaxLabels
protected int m_MaxLabels
the maximum number of labels for nominal attributes.
-
-
Method Detail
-
globalInfo
public String globalInfo()
Returns a string describing this loader- Returns:
- a description of the evaluator suitable for displaying in the explorer/experimenter gui
-
listOptions
public Enumeration listOptions()
Lists the available options- Specified by:
listOptionsin interfaceweka.core.OptionHandler- Returns:
- an enumeration of the available options
-
setOptions
public void setOptions(String[] options) throws Exception
Parses a given list of options.- Specified by:
setOptionsin interfaceweka.core.OptionHandler- Parameters:
options- the options- Throws:
Exception- if options cannot be set
-
getOptions
public String[] getOptions()
Gets the setting- Specified by:
getOptionsin interfaceweka.core.OptionHandler- Returns:
- the current setting
-
setDebug
public void setDebug(boolean value)
Sets whether to print some debug information.- Parameters:
value- if true additional debug information will be printed.
-
getDebug
public boolean getDebug()
Gets whether additional debug information is printed.- Returns:
- true if additional debug information is printed
-
debugTipText
public String debugTipText()
the tip text for this property- Returns:
- the tip text
-
setSheetIndex
public void setSheetIndex(Index value)
Sets the index of the sheet to load.- Parameters:
value- the index
-
getSheetIndex
public Index getSheetIndex()
Returns the index of the sheet to load.- Returns:
- the index
-
sheetIndexTipText
public String sheetIndexTipText()
The tip text for this property.- Returns:
- the tip text
-
setAutoExtendHeader
public void setAutoExtendHeader(boolean value)
Sets whether to automatically extend the header if there are more columns present.- Parameters:
value- true if to extend
-
getAutoExtendHeader
public boolean getAutoExtendHeader()
Returns whether to automatically extend the header if there are more columns present.- Returns:
- the reader in use.
-
autoExtendHeaderTipText
public String autoExtendHeaderTipText()
The tip text for this property.- Returns:
- the tip text
-
setTextColumns
public void setTextColumns(Range value)
Sets the range of columns to treat as text/string.- Parameters:
value- the range
-
getTextColumns
public Range getTextColumns()
Returns the range of columns to treat as text/string.- Returns:
- the range
-
textColumnsTipText
public String textColumnsTipText()
The tip text for this property.- Returns:
- the tip text
-
setNoHeader
public void setNoHeader(boolean value)
Sets whether there is now header row in the worksheet.- Parameters:
value- true if no header row
-
getNoHeader
public boolean getNoHeader()
Returns whether there is now header row in the worksheet- Returns:
- true if no header row
-
noHeaderTipText
public String noHeaderTipText()
The tip text for this property.- Returns:
- the tip text
-
setCustomColumnHeaders
public void setCustomColumnHeaders(String value)
Sets the custom headers to use.- Parameters:
value- the headers (comma-separated list)
-
getCustomColumnHeaders
public String getCustomColumnHeaders()
Returns the custom headers to use.- Returns:
- the headers (comma-separated list)
-
customColumnHeadersTipText
public String customColumnHeadersTipText()
The tip text for this property.- Returns:
- the tip text
-
setFirstRow
public void setFirstRow(int value)
Sets the first row in the worksheet to read.- Parameters:
value- the row (1-based)
-
getFirstRow
public int getFirstRow()
Returns the first row in the worksheet to read.- Returns:
- the row (1-based)
-
firstRowTipText
public String firstRowTipText()
The tip text for this property.- Returns:
- the tip text
-
setNumRows
public void setNumRows(int value)
Sets the number of rows to read.- Parameters:
value- the number of rows, <1 for all
-
getNumRows
public int getNumRows()
Returns the number of rows to read.- Returns:
- the number of rows, <1 for all
-
numRowsTipText
public String numRowsTipText()
The tip text for this property.- Returns:
- the tip text
-
setMissingValue
public void setMissingValue(BaseRegExp value)
Sets the regular expression for identifying missing value.- Parameters:
value- the regexp
-
getMissingValue
public BaseRegExp getMissingValue()
Returns the regular expression for identifying missing values.- Returns:
- the regexp
-
missingValueTipText
public String missingValueTipText()
The tip text for this property.- Returns:
- the tip text
-
setMaxLabels
public void setMaxLabels(int value)
Sets the maximum number of labels for nominal attributes before they get converted to string.- Parameters:
value- the maximum
-
getMaxLabels
public int getMaxLabels()
Returns the maximum number of labels for nominal attributes before they get converted to string.- Returns:
- the maximum
-
maxLabelsTipText
public String maxLabelsTipText()
The tip text for this property.- Returns:
- the tip text
-
getFileDescription
public String getFileDescription()
Returns a description of the file type.- Specified by:
getFileDescriptionin interfaceweka.core.converters.FileSourcedConverter- Returns:
- a short file description
-
getFileExtension
public String getFileExtension()
Get the file extension used for this type of file- Specified by:
getFileExtensionin interfaceweka.core.converters.FileSourcedConverter- Returns:
- the file extension
-
getFileExtensions
public String[] getFileExtensions()
Gets all the file extensions used for this type of file- Specified by:
getFileExtensionsin interfaceweka.core.converters.FileSourcedConverter- Returns:
- the file extensions
-
reset
public void reset() throws IOExceptionResets the loader ready to read a new data set- Specified by:
resetin interfaceweka.core.converters.Loader- Overrides:
resetin classweka.core.converters.AbstractFileLoader- Throws:
IOException
-
setSource
public void setSource(File file) throws IOException
Resets the Loader object and sets the source of the data set to be the supplied File object.- Specified by:
setSourcein interfaceweka.core.converters.Loader- Overrides:
setSourcein classweka.core.converters.AbstractFileLoader- Parameters:
file- the source file.- Throws:
IOException- if an error occurs
-
numericToString
protected String numericToString(org.apache.poi.ss.usermodel.Cell cell)
Turns a numeric cell into a string. Tries to use "long" representation if possible.- Parameters:
cell- the cell to process- Returns:
- the string representation
-
fixHeader
protected void fixHeader(List<String> header)
Fixes the header, if necessary, by adding a dummy column name.- Parameters:
header- the header to fix
-
fixRows
protected void fixRows(int numColumns, List<List<Object>> data)Fixes the number of cells in the rows, if necessary, by adding null values.- Parameters:
numColumns- the number of columns in the datasetdata- the data to fix
-
determineAttributes
protected ArrayList<weka.core.Attribute> determineAttributes(List<String> header, List<List<Object>> data)
Fixes the columns types, if necessary.- Parameters:
header- the column namesdata- the data to infer the types from- Returns:
- the attributes
-
convert
protected weka.core.Instances convert(ArrayList<weka.core.Attribute> atts, List<List<Object>> data)
Converts the header/data to instances.- Parameters:
atts- the attributesdata- the data- Returns:
- the generated data
-
readWorksheet
protected weka.core.Instances readWorksheet()
Reads the worksheet.- Returns:
- the worksheet data
-
getStructure
public weka.core.Instances getStructure() throws IOExceptionDetermines and returns (if possible) the structure (internally the header) of the data set as an empty set of instances.- Specified by:
getStructurein interfaceweka.core.converters.Loader- Specified by:
getStructurein classweka.core.converters.AbstractLoader- Returns:
- the structure of the data set as an empty set of Instances
- Throws:
IOException- if an error occurs
-
getDataSet
public weka.core.Instances getDataSet() throws IOExceptionReturn the full data set. If the structure hasn't yet been determined by a call to getStructure then method should do so before processing the rest of the data set.- Specified by:
getDataSetin interfaceweka.core.converters.Loader- Specified by:
getDataSetin classweka.core.converters.AbstractLoader- Returns:
- the structure of the data set as an empty set of Instances
- Throws:
IOException- if there is no source or parsing fails
-
getNextInstance
public weka.core.Instance getNextInstance(weka.core.Instances structure) throws IOExceptionSpreadSheetLoader is unable to process a data set incrementally.- Specified by:
getNextInstancein interfaceweka.core.converters.Loader- Specified by:
getNextInstancein classweka.core.converters.AbstractLoader- Parameters:
structure- ignored- Returns:
- never returns without throwing an exception
- Throws:
IOException- always. AdamsCsvLoader is unable to process a data set incrementally.
-
getRevision
public String getRevision()
Returns the revision string.- Specified by:
getRevisionin interfaceweka.core.RevisionHandler- Returns:
- the revision
-
main
public static void main(String[] args)
Main method.- Parameters:
args- should contain the name of an input file.
-
-