Class TesseractOCR

  • All Implemented Interfaces:
    AdditionalInformationHandler, CleanUpHandler, Destroyable, GlobalInfoSupporter, LoggingLevelHandler, LoggingSupporter, OptionHandler, QuickInfoSupporter, ShallowCopySupporter<Actor>, SizeOfHandler, Stoppable, StoppableWithFeedback, VariablesInspectionHandler, VariableChangeListener, Actor, ErrorHandler, InputConsumer, OutputProducer, Serializable, Comparable

    public class TesseractOCR
    extends AbstractTransformer
    Applies OCR to the incoming image file using Tesseract.
    In case of successful OCR, either the file names of the generated files are broadcast or the combined text of the files.
    NB: The actor deletes all files that have the same prefix as the specified output base. Something you need to be aware of when doing OCR in parallel or generate other files with the same prefix.

    For more information on tesseract see:
    https://github.com/tesseract-ocr/tesseract


    For more information on hOCR see:
    https://en.wikipedia.org/wiki/HOCR

    Input/output:
    - accepts:
       java.lang.String
       java.io.File
       adams.data.image.AbstractImage
    - generates:
       java.lang.String[]


    -logging-level <OFF|SEVERE|WARNING|INFO|CONFIG|FINE|FINER|FINEST> (property: loggingLevel)
        The logging level for outputting errors and debugging output.
        default: WARNING
     
    -name <java.lang.String> (property: name)
        The name of the actor.
        default: TesseractOCR
     
    -annotation <adams.core.base.BaseAnnotation> (property: annotations)
        The annotations to attach to this actor.
        default: 
     
    -skip <boolean> (property: skip)
        If set to true, transformation is skipped and the input token is just forwarded 
        as it is.
        default: false
     
    -stop-flow-on-error <boolean> (property: stopFlowOnError)
        If set to true, the flow gets stopped in case this actor encounters an error;
         useful for critical actors.
        default: false
     
    -language <ALBANIAN|ARABIC|AZERBAUIJANI|BULGARIAN|CATALAN|CHEROKEE|CROATION|CZECH|DANISH|DANISH_FRAKTUR|DUTCH|ENGLISH|ESPERANTO|ESTONIAN|FINNISH|FRENCH|GALICIAN|GERMAN|GREEK|HEBREW|HINDI|HUNGARIAN|INDONESIAN|ITALIAN|JAPANESE|KOREAN|LATVIAN|LITHUANIAN|NORWEGIAN|OLD_ENGLISH|OLD_FRENCH|POLISH|PORTUGUESE|ROMANIAN|RUSSIAN|SERBIAN|SIMPLIFIED_CHINESE|SLOVAKIAN|SLOVENIAN|SPANISH|SWEDISH|TAGALOG|TAMIL|TELUGU|THAI|TRADITIONAL_CHINESE|TURKISH|UKRAINIAN|VIETNAMESE> (property: language)
        The language to use for OCR (must be installed).
        default: ENGLISH
     
    -page-segmentation <OSD_ONLY|AUTO_WITH_OSD|AUTO_NO_OSD|FULL_AUTO_NO_OSD|SINGLE_COLUMN|SINGLE_VERTICAL_BLOCK|SINGLE_BLOCK|SINGLE_LINE|SINGLE_WORD|SINGLE_WORD_CIRCLE|SINGLE_CHARACTER> (property: pageSegmentation)
        The page segementation to use.
        default: FULL_AUTO_NO_OSD
     
    -output-base <adams.core.io.PlaceholderFile> (property: outputBase)
        The base name for the generated file(s).
        default: ${TMP}/outputbase
     
    -output-text <boolean> (property: outputText)
        If enabled, text combined text of all generated files is output rather than 
        the file names.
        default: false
     
    -separator <java.lang.String> (property: separator)
        The separator used between the content of two files if text rather than 
        the file names is forwarded; you can use special characters like \n and 
        \t as well
        default: 
     
    -output-hocr <boolean> (property: outputHOCR)
        If enabled, HTML files using the hOCR format are generated rather than ASCII 
        files.
        default: false
     
    Author:
    fracpete (fracpete at waikato dot ac dot nz)
    See Also:
    Serialized Form
    • Field Detail

      • m_OutputText

        protected boolean m_OutputText
        whether to output the OCRed text instead of the files.
      • m_Separator

        protected String m_Separator
        the separator between multiple text files.
      • m_OutputHOCR

        protected boolean m_OutputHOCR
        whether to output hOCR instead of ASCII.
      • m_ProcessOutput

        protected transient com.github.fracpete.processoutput4j.output.CollectingProcessOutput m_ProcessOutput
        for executing tesseract.
    • Constructor Detail

      • TesseractOCR

        public TesseractOCR()
    • Method Detail

      • setLanguage

        public void setLanguage​(TesseractLanguage value)
        Sets the language to use (needs to be installed).
        Parameters:
        value - the language
      • getLanguage

        public TesseractLanguage getLanguage()
        Returns the language to use.
        Returns:
        the language
      • languageTipText

        public String languageTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setPageSegmentation

        public void setPageSegmentation​(TesseractPageSegmentation value)
        Sets the page segmentation to use.
        Parameters:
        value - the page segmentation
      • getPageSegmentation

        public TesseractPageSegmentation getPageSegmentation()
        Returns the page segmentation to use.
        Returns:
        the page segmentation
      • pageSegmentationTipText

        public String pageSegmentationTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setOutputBase

        public void setOutputBase​(PlaceholderFile value)
        Sets the base name for the generated file(s).
        Parameters:
        value - the base name
      • getOutputBase

        public PlaceholderFile getOutputBase()
        Returns the base name for the generated file(s).
        Returns:
        the base name
      • outputBaseTipText

        public String outputBaseTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setOutputText

        public void setOutputText​(boolean value)
        Sets whether to output the content of all files rather than the files.
        Parameters:
        value - true if to output the text
      • getOutputText

        public boolean getOutputText()
        Returns whether to output the content of all files rather than the files.
        Returns:
        true if text is output
      • outputTextTipText

        public String outputTextTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setSeparator

        public void setSeparator​(String value)
        Sets the separator between text files, in case text is being output rather than file names.
        Parameters:
        value - the backquoted separator
      • getSeparator

        public String getSeparator()
        Returns the separator between text files, in case text is being output rather than file names.
        Returns:
        the backquoted separator
      • separatorTipText

        public String separatorTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • setOutputHOCR

        public void setOutputHOCR​(boolean value)
        Sets whether to use hOCR format instead of ASCII.
        Parameters:
        value - true if to output hOCR
      • getOutputHOCR

        public boolean getOutputHOCR()
        Returns whether to use hOCR format as output instead of ASCII.
        Returns:
        true if to output hOCR
      • outputHOCRTipText

        public String outputHOCRTipText()
        Returns the tip text for this property.
        Returns:
        tip text for this property suitable for displaying in the GUI or for listing the options.
      • accepts

        public Class[] accepts()
        Returns the class that the consumer accepts.
        Returns:
        the Class of objects that can be processed
      • generates

        public Class[] generates()
        Returns the class of objects that it generates.
        Returns:
        the Class of the generated tokens
      • doExecute

        protected String doExecute()
        Executes the flow item.
        Specified by:
        doExecute in class AbstractActor
        Returns:
        null if everything is fine, otherwise error message