Class ExcelLoader

  • All Implemented Interfaces:
    Serializable, weka.core.converters.BatchConverter, weka.core.converters.FileSourcedConverter, weka.core.converters.Loader, weka.core.EnvironmentHandler, weka.core.OptionHandler, weka.core.RevisionHandler

    public class ExcelLoader
    extends weka.core.converters.AbstractFileLoader
    implements weka.core.converters.BatchConverter, weka.core.OptionHandler
    Loads MS Excel spreadsheet files.

    Valid options are:

     -D
      Enables debug output.
      (default: off)
     -sheet-index <1-based index>
      The index of the worksheet to load. (default: 1)
     -auto-extend-header
      Enables automatically extending the header.
      (default: off)
     -text-columns <range>
      The range of columns to treat as text. (default: none)
     -no-header
      If enabled, the spreadsheet is presumed to have no header row.
      (default: off)
     -custom-column-headers <comma-separated list>
      The headers to use instead (comma-separated list). (default: none)
     -first-row <index>
      The first row in the spreadsheet (starts at 1). (default: 1)
     -num-rows <count>
      The number of rows to read, read all if <1. (default: 0)
     -missing-value <regexp>
      The regular expression for identifying missing values. (default: ^(\?|)$)
     -max-labels <int>
      The maximum number of labels for nominal attributes before  they get converted to string. (default: 25)
    Author:
    fracpete (fracpete at waikato dot ac dot nz)
    See Also:
    Loader, Serialized Form
    • Nested Class Summary

      • Nested classes/interfaces inherited from interface weka.core.converters.Loader

        weka.core.converters.Loader.StructureNotReadyException
    • Field Summary

      Fields 
      Modifier and Type Field Description
      protected static adams.core.base.BaseRegExp DEFAULT_MISSING_VALUE  
      protected boolean m_AutoExtendHeader
      whether to automatically extend the header if rows have more cells than header.
      protected String m_CustomColumnHeaders
      the comma-separated list of column header names.
      protected weka.core.Instances m_Data
      the actual data.
      protected boolean m_Debug
      whether to print some debug information
      protected int m_FirstRow
      the first row to retrieve (1-based).
      protected int m_MaxLabels
      the maximum number of labels for nominal attributes.
      protected adams.core.base.BaseRegExp m_MissingValue
      The placeholder for missing values.
      protected boolean m_NoHeader
      whether the file has a header or not.
      protected int m_NumRows
      the number of rows to retrieve (less than 1 = unlimited).
      protected adams.core.Index m_SheetIndex
      the sheet to read.
      protected File m_sourceFile
      Holds the source of the data set.
      protected weka.core.Instances m_structure
      Holds the determined structure (header) of the data set.
      protected adams.core.Range m_TextColumns
      the range of columns to force to be text.
      • Fields inherited from class weka.core.converters.AbstractFileLoader

        FILE_EXTENSION_COMPRESSED, m_env, m_File, m_useRelativePath
      • Fields inherited from class weka.core.converters.AbstractLoader

        m_retrieval
      • Fields inherited from interface weka.core.converters.Loader

        BATCH, INCREMENTAL, NONE
    • Constructor Summary

      Constructors 
      Constructor Description
      ExcelLoader()
      default constructor
    • Field Detail

      • m_structure

        protected weka.core.Instances m_structure
        Holds the determined structure (header) of the data set.
      • m_Data

        protected weka.core.Instances m_Data
        the actual data.
      • m_sourceFile

        protected File m_sourceFile
        Holds the source of the data set.
      • m_Debug

        protected boolean m_Debug
        whether to print some debug information
      • m_SheetIndex

        protected adams.core.Index m_SheetIndex
        the sheet to read.
      • m_AutoExtendHeader

        protected boolean m_AutoExtendHeader
        whether to automatically extend the header if rows have more cells than header.
      • m_TextColumns

        protected adams.core.Range m_TextColumns
        the range of columns to force to be text.
      • m_NoHeader

        protected boolean m_NoHeader
        whether the file has a header or not.
      • m_CustomColumnHeaders

        protected String m_CustomColumnHeaders
        the comma-separated list of column header names.
      • m_FirstRow

        protected int m_FirstRow
        the first row to retrieve (1-based).
      • m_NumRows

        protected int m_NumRows
        the number of rows to retrieve (less than 1 = unlimited).
      • DEFAULT_MISSING_VALUE

        protected static final adams.core.base.BaseRegExp DEFAULT_MISSING_VALUE
      • m_MissingValue

        protected adams.core.base.BaseRegExp m_MissingValue
        The placeholder for missing values.
      • m_MaxLabels

        protected int m_MaxLabels
        the maximum number of labels for nominal attributes.
    • Constructor Detail

      • ExcelLoader

        public ExcelLoader()
        default constructor
    • Method Detail

      • globalInfo

        public String globalInfo()
        Returns a string describing this loader
        Returns:
        a description of the evaluator suitable for displaying in the explorer/experimenter gui
      • listOptions

        public Enumeration listOptions()
        Lists the available options
        Specified by:
        listOptions in interface weka.core.OptionHandler
        Returns:
        an enumeration of the available options
      • setOptions

        public void setOptions​(String[] options)
                        throws Exception
        Parses a given list of options.
        Specified by:
        setOptions in interface weka.core.OptionHandler
        Parameters:
        options - the options
        Throws:
        Exception - if options cannot be set
      • getOptions

        public String[] getOptions()
        Gets the setting
        Specified by:
        getOptions in interface weka.core.OptionHandler
        Returns:
        the current setting
      • setDebug

        public void setDebug​(boolean value)
        Sets whether to print some debug information.
        Parameters:
        value - if true additional debug information will be printed.
      • getDebug

        public boolean getDebug()
        Gets whether additional debug information is printed.
        Returns:
        true if additional debug information is printed
      • debugTipText

        public String debugTipText()
        the tip text for this property
        Returns:
        the tip text
      • setSheetIndex

        public void setSheetIndex​(adams.core.Index value)
        Sets the index of the sheet to load.
        Parameters:
        value - the index
      • getSheetIndex

        public adams.core.Index getSheetIndex()
        Returns the index of the sheet to load.
        Returns:
        the index
      • sheetIndexTipText

        public String sheetIndexTipText()
        The tip text for this property.
        Returns:
        the tip text
      • setAutoExtendHeader

        public void setAutoExtendHeader​(boolean value)
        Sets whether to automatically extend the header if there are more columns present.
        Parameters:
        value - true if to extend
      • getAutoExtendHeader

        public boolean getAutoExtendHeader()
        Returns whether to automatically extend the header if there are more columns present.
        Returns:
        the reader in use.
      • autoExtendHeaderTipText

        public String autoExtendHeaderTipText()
        The tip text for this property.
        Returns:
        the tip text
      • setTextColumns

        public void setTextColumns​(adams.core.Range value)
        Sets the range of columns to treat as text/string.
        Parameters:
        value - the range
      • getTextColumns

        public adams.core.Range getTextColumns()
        Returns the range of columns to treat as text/string.
        Returns:
        the range
      • textColumnsTipText

        public String textColumnsTipText()
        The tip text for this property.
        Returns:
        the tip text
      • setNoHeader

        public void setNoHeader​(boolean value)
        Sets whether there is now header row in the worksheet.
        Parameters:
        value - true if no header row
      • getNoHeader

        public boolean getNoHeader()
        Returns whether there is now header row in the worksheet
        Returns:
        true if no header row
      • noHeaderTipText

        public String noHeaderTipText()
        The tip text for this property.
        Returns:
        the tip text
      • setCustomColumnHeaders

        public void setCustomColumnHeaders​(String value)
        Sets the custom headers to use.
        Parameters:
        value - the headers (comma-separated list)
      • getCustomColumnHeaders

        public String getCustomColumnHeaders()
        Returns the custom headers to use.
        Returns:
        the headers (comma-separated list)
      • customColumnHeadersTipText

        public String customColumnHeadersTipText()
        The tip text for this property.
        Returns:
        the tip text
      • setFirstRow

        public void setFirstRow​(int value)
        Sets the first row in the worksheet to read.
        Parameters:
        value - the row (1-based)
      • getFirstRow

        public int getFirstRow()
        Returns the first row in the worksheet to read.
        Returns:
        the row (1-based)
      • firstRowTipText

        public String firstRowTipText()
        The tip text for this property.
        Returns:
        the tip text
      • setNumRows

        public void setNumRows​(int value)
        Sets the number of rows to read.
        Parameters:
        value - the number of rows, <1 for all
      • getNumRows

        public int getNumRows()
        Returns the number of rows to read.
        Returns:
        the number of rows, <1 for all
      • numRowsTipText

        public String numRowsTipText()
        The tip text for this property.
        Returns:
        the tip text
      • setMissingValue

        public void setMissingValue​(adams.core.base.BaseRegExp value)
        Sets the regular expression for identifying missing value.
        Parameters:
        value - the regexp
      • getMissingValue

        public adams.core.base.BaseRegExp getMissingValue()
        Returns the regular expression for identifying missing values.
        Returns:
        the regexp
      • missingValueTipText

        public String missingValueTipText()
        The tip text for this property.
        Returns:
        the tip text
      • setMaxLabels

        public void setMaxLabels​(int value)
        Sets the maximum number of labels for nominal attributes before they get converted to string.
        Parameters:
        value - the maximum
      • getMaxLabels

        public int getMaxLabels()
        Returns the maximum number of labels for nominal attributes before they get converted to string.
        Returns:
        the maximum
      • maxLabelsTipText

        public String maxLabelsTipText()
        The tip text for this property.
        Returns:
        the tip text
      • getFileDescription

        public String getFileDescription()
        Returns a description of the file type.
        Specified by:
        getFileDescription in interface weka.core.converters.FileSourcedConverter
        Returns:
        a short file description
      • getFileExtension

        public String getFileExtension()
        Get the file extension used for this type of file
        Specified by:
        getFileExtension in interface weka.core.converters.FileSourcedConverter
        Returns:
        the file extension
      • getFileExtensions

        public String[] getFileExtensions()
        Gets all the file extensions used for this type of file
        Specified by:
        getFileExtensions in interface weka.core.converters.FileSourcedConverter
        Returns:
        the file extensions
      • reset

        public void reset()
                   throws IOException
        Resets the loader ready to read a new data set
        Specified by:
        reset in interface weka.core.converters.Loader
        Overrides:
        reset in class weka.core.converters.AbstractFileLoader
        Throws:
        IOException
      • setSource

        public void setSource​(File file)
                       throws IOException
        Resets the Loader object and sets the source of the data set to be the supplied File object.
        Specified by:
        setSource in interface weka.core.converters.Loader
        Overrides:
        setSource in class weka.core.converters.AbstractFileLoader
        Parameters:
        file - the source file.
        Throws:
        IOException - if an error occurs
      • numericToString

        protected String numericToString​(org.apache.poi.ss.usermodel.Cell cell)
        Turns a numeric cell into a string. Tries to use "long" representation if possible.
        Parameters:
        cell - the cell to process
        Returns:
        the string representation
      • fixHeader

        protected void fixHeader​(List<String> header)
        Fixes the header, if necessary, by adding a dummy column name.
        Parameters:
        header - the header to fix
      • fixRows

        protected void fixRows​(int numColumns,
                               List<List<Object>> data)
        Fixes the number of cells in the rows, if necessary, by adding null values.
        Parameters:
        numColumns - the number of columns in the dataset
        data - the data to fix
      • determineAttributes

        protected ArrayList<weka.core.Attribute> determineAttributes​(List<String> header,
                                                                     List<List<Object>> data)
        Fixes the columns types, if necessary.
        Parameters:
        header - the column names
        data - the data to infer the types from
        Returns:
        the attributes
      • convert

        protected weka.core.Instances convert​(ArrayList<weka.core.Attribute> atts,
                                              List<List<Object>> data)
        Converts the header/data to instances.
        Parameters:
        atts - the attributes
        data - the data
        Returns:
        the generated data
      • readWorksheet

        protected weka.core.Instances readWorksheet()
        Reads the worksheet.
        Returns:
        the worksheet data
      • getStructure

        public weka.core.Instances getStructure()
                                         throws IOException
        Determines and returns (if possible) the structure (internally the header) of the data set as an empty set of instances.
        Specified by:
        getStructure in interface weka.core.converters.Loader
        Specified by:
        getStructure in class weka.core.converters.AbstractLoader
        Returns:
        the structure of the data set as an empty set of Instances
        Throws:
        IOException - if an error occurs
      • getDataSet

        public weka.core.Instances getDataSet()
                                       throws IOException
        Return the full data set. If the structure hasn't yet been determined by a call to getStructure then method should do so before processing the rest of the data set.
        Specified by:
        getDataSet in interface weka.core.converters.Loader
        Specified by:
        getDataSet in class weka.core.converters.AbstractLoader
        Returns:
        the structure of the data set as an empty set of Instances
        Throws:
        IOException - if there is no source or parsing fails
      • getNextInstance

        public weka.core.Instance getNextInstance​(weka.core.Instances structure)
                                           throws IOException
        SpreadSheetLoader is unable to process a data set incrementally.
        Specified by:
        getNextInstance in interface weka.core.converters.Loader
        Specified by:
        getNextInstance in class weka.core.converters.AbstractLoader
        Parameters:
        structure - ignored
        Returns:
        never returns without throwing an exception
        Throws:
        IOException - always. AdamsCsvLoader is unable to process a data set incrementally.
      • getRevision

        public String getRevision()
        Returns the revision string.
        Specified by:
        getRevision in interface weka.core.RevisionHandler
        Returns:
        the revision
      • main

        public static void main​(String[] args)
        Main method.
        Parameters:
        args - should contain the name of an input file.