Package adams.flow.transformer
Class TesseractOCR
-
- All Implemented Interfaces:
AdditionalInformationHandler
,CleanUpHandler
,Destroyable
,GlobalInfoSupporter
,LoggingLevelHandler
,LoggingSupporter
,OptionHandler
,QuickInfoSupporter
,ShallowCopySupporter<Actor>
,SizeOfHandler
,Stoppable
,StoppableWithFeedback
,VariablesInspectionHandler
,VariableChangeListener
,Actor
,ErrorHandler
,InputConsumer
,OutputProducer
,Serializable
,Comparable
public class TesseractOCR extends AbstractTransformer
Applies OCR to the incoming image file using Tesseract.
In case of successful OCR, either the file names of the generated files are broadcast or the combined text of the files.
NB: The actor deletes all files that have the same prefix as the specified output base. Something you need to be aware of when doing OCR in parallel or generate other files with the same prefix.
For more information on tesseract see:
https://github.com/tesseract-ocr/tesseract
For more information on hOCR see:
https://en.wikipedia.org/wiki/HOCR
Input/output:
- accepts:
java.lang.String
java.io.File
adams.data.image.AbstractImage
- generates:
java.lang.String[]
-logging-level <OFF|SEVERE|WARNING|INFO|CONFIG|FINE|FINER|FINEST> (property: loggingLevel) The logging level for outputting errors and debugging output. default: WARNING
-name <java.lang.String> (property: name) The name of the actor. default: TesseractOCR
-annotation <adams.core.base.BaseAnnotation> (property: annotations) The annotations to attach to this actor. default:
-skip <boolean> (property: skip) If set to true, transformation is skipped and the input token is just forwarded as it is. default: false
-stop-flow-on-error <boolean> (property: stopFlowOnError) If set to true, the flow gets stopped in case this actor encounters an error; useful for critical actors. default: false
-language <ALBANIAN|ARABIC|AZERBAUIJANI|BULGARIAN|CATALAN|CHEROKEE|CROATION|CZECH|DANISH|DANISH_FRAKTUR|DUTCH|ENGLISH|ESPERANTO|ESTONIAN|FINNISH|FRENCH|GALICIAN|GERMAN|GREEK|HEBREW|HINDI|HUNGARIAN|INDONESIAN|ITALIAN|JAPANESE|KOREAN|LATVIAN|LITHUANIAN|NORWEGIAN|OLD_ENGLISH|OLD_FRENCH|POLISH|PORTUGUESE|ROMANIAN|RUSSIAN|SERBIAN|SIMPLIFIED_CHINESE|SLOVAKIAN|SLOVENIAN|SPANISH|SWEDISH|TAGALOG|TAMIL|TELUGU|THAI|TRADITIONAL_CHINESE|TURKISH|UKRAINIAN|VIETNAMESE> (property: language) The language to use for OCR (must be installed). default: ENGLISH
-page-segmentation <OSD_ONLY|AUTO_WITH_OSD|AUTO_NO_OSD|FULL_AUTO_NO_OSD|SINGLE_COLUMN|SINGLE_VERTICAL_BLOCK|SINGLE_BLOCK|SINGLE_LINE|SINGLE_WORD|SINGLE_WORD_CIRCLE|SINGLE_CHARACTER> (property: pageSegmentation) The page segementation to use. default: FULL_AUTO_NO_OSD
-output-base <adams.core.io.PlaceholderFile> (property: outputBase) The base name for the generated file(s). default: ${TMP}/outputbase
-output-text <boolean> (property: outputText) If enabled, text combined text of all generated files is output rather than the file names. default: false
-separator <java.lang.String> (property: separator) The separator used between the content of two files if text rather than the file names is forwarded; you can use special characters like \n and \t as well default:
-output-hocr <boolean> (property: outputHOCR) If enabled, HTML files using the hOCR format are generated rather than ASCII files. default: false
- Author:
- fracpete (fracpete at waikato dot ac dot nz)
- See Also:
- Serialized Form
-
-
Field Summary
Fields Modifier and Type Field Description protected TesseractConfiguration
m_Configuration
the tesseract connection to use.protected TesseractLanguage
m_Language
the language to use.protected PlaceholderFile
m_OutputBase
the output base.protected boolean
m_OutputHOCR
whether to output hOCR instead of ASCII.protected boolean
m_OutputText
whether to output the OCRed text instead of the files.protected TesseractPageSegmentation
m_PageSegmentation
the page segmentation to use.protected com.github.fracpete.processoutput4j.output.CollectingProcessOutput
m_ProcessOutput
for executing tesseract.protected String
m_Separator
the separator between multiple text files.-
Fields inherited from class adams.flow.transformer.AbstractTransformer
BACKUP_INPUT, BACKUP_OUTPUT, m_InputToken, m_OutputToken
-
Fields inherited from class adams.flow.core.AbstractActor
m_Annotations, m_BackupState, m_DetectedObjectVariables, m_DetectedVariables, m_ErrorHandler, m_Executed, m_Executing, m_ExecutionListeningSupporter, m_FullName, m_LoggingPrefix, m_Name, m_Parent, m_ScopeHandler, m_Self, m_Silent, m_Skip, m_StopFlowOnError, m_StopMessage, m_Stopped, m_StorageHandler, m_VariablesUpdated
-
Fields inherited from class adams.core.option.AbstractOptionHandler
m_OptionManager
-
Fields inherited from class adams.core.logging.LoggingObject
m_Logger, m_LoggingIsEnabled, m_LoggingLevel
-
Fields inherited from interface adams.flow.core.Actor
FILE_EXTENSION, FILE_EXTENSION_GZ
-
-
Constructor Summary
Constructors Constructor Description TesseractOCR()
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description Class[]
accepts()
Returns the class that the consumer accepts.void
defineOptions()
Adds options to the internal list of options.protected String
doExecute()
Executes the flow item.Class[]
generates()
Returns the class of objects that it generates.TesseractLanguage
getLanguage()
Returns the language to use.PlaceholderFile
getOutputBase()
Returns the base name for the generated file(s).boolean
getOutputHOCR()
Returns whether to use hOCR format as output instead of ASCII.boolean
getOutputText()
Returns whether to output the content of all files rather than the files.TesseractPageSegmentation
getPageSegmentation()
Returns the page segmentation to use.String
getQuickInfo()
Returns a quick info about the actor, which will be displayed in the GUI.String
getSeparator()
Returns the separator between text files, in case text is being output rather than file names.String
globalInfo()
Returns a string describing the object.String
languageTipText()
Returns the tip text for this property.String
outputBaseTipText()
Returns the tip text for this property.String
outputHOCRTipText()
Returns the tip text for this property.String
outputTextTipText()
Returns the tip text for this property.String
pageSegmentationTipText()
Returns the tip text for this property.String
separatorTipText()
Returns the tip text for this property.void
setLanguage(TesseractLanguage value)
Sets the language to use (needs to be installed).void
setOutputBase(PlaceholderFile value)
Sets the base name for the generated file(s).void
setOutputHOCR(boolean value)
Sets whether to use hOCR format instead of ASCII.void
setOutputText(boolean value)
Sets whether to output the content of all files rather than the files.void
setPageSegmentation(TesseractPageSegmentation value)
Sets the page segmentation to use.void
setSeparator(String value)
Sets the separator between text files, in case text is being output rather than file names.String
setUp()
Initializes the item for flow execution.void
stopExecution()
Stops the execution.-
Methods inherited from class adams.flow.transformer.AbstractTransformer
backupState, currentInput, execute, hasInput, hasPendingOutput, input, output, postExecute, restoreState, wrapUp
-
Methods inherited from class adams.flow.core.AbstractActor
annotationsTipText, canInspectOptions, canPerformSetUpCheck, cleanUp, compareTo, configureLogger, destroy, equals, finalUpdateVariables, findVariables, findVariables, forceVariables, forCommandLine, forName, forName, getAdditionalInformation, getAnnotations, getDefaultName, getDetectedVariables, getErrorHandler, getFlowActors, getFlowExecutionListeningSupporter, getFullName, getName, getNextSibling, getParent, getParentComponent, getPreviousSibling, getRoot, getScopeHandler, getSilent, getSkip, getStopFlowOnError, getStopMessage, getStorageHandler, getVariables, handleError, handleException, hasErrorHandler, hasStopMessage, index, initialize, isBackedUp, isExecuted, isExecuting, isFinished, isHeadless, isStopped, nameTipText, performSetUpChecks, performVariableChecks, preExecute, pruneBackup, pruneBackup, reset, setAnnotations, setErrorHandler, setName, setParent, setSilent, setSkip, setStopFlowOnError, setVariables, shallowCopy, shallowCopy, silentTipText, sizeOf, skipTipText, stopExecution, stopFlowOnErrorTipText, updateDetectedVariables, updatePrefix, updateVariables, variableChanged
-
Methods inherited from class adams.core.option.AbstractOptionHandler
cleanUpOptions, finishInit, getDefaultLoggingLevel, getOptionManager, loggingLevelTipText, newOptionManager, setLoggingLevel, toCommandLine, toString
-
Methods inherited from class adams.core.logging.LoggingObject
getLogger, getLoggingLevel, initializeLogging, isLoggingEnabled
-
Methods inherited from class java.lang.Object
clone, finalize, getClass, hashCode, notify, notifyAll, wait, wait, wait
-
Methods inherited from interface adams.flow.core.Actor
cleanUp, compareTo, destroy, equals, findVariables, getAnnotations, getDefaultName, getDetectedVariables, getErrorHandler, getFlowExecutionListeningSupporter, getFullName, getName, getNextSibling, getParent, getParentComponent, getPreviousSibling, getRoot, getScopeHandler, getSilent, getSkip, getStopFlowOnError, getStopMessage, getStorageHandler, getVariables, handleError, hasErrorHandler, hasStopMessage, index, isExecuted, isFinished, isHeadless, isStopped, setAnnotations, setErrorHandler, setName, setParent, setSilent, setSkip, setStopFlowOnError, setVariables, shallowCopy, shallowCopy, sizeOf, stopExecution, toCommandLine, variableChanged
-
Methods inherited from interface adams.core.AdditionalInformationHandler
getAdditionalInformation
-
Methods inherited from interface adams.core.logging.LoggingLevelHandler
getLoggingLevel, setLoggingLevel
-
Methods inherited from interface adams.core.logging.LoggingSupporter
getLogger, isLoggingEnabled
-
Methods inherited from interface adams.core.option.OptionHandler
cleanUpOptions, getOptionManager
-
Methods inherited from interface adams.core.VariablesInspectionHandler
canInspectOptions
-
-
-
-
Field Detail
-
m_Language
protected TesseractLanguage m_Language
the language to use.
-
m_PageSegmentation
protected TesseractPageSegmentation m_PageSegmentation
the page segmentation to use.
-
m_OutputBase
protected PlaceholderFile m_OutputBase
the output base.
-
m_OutputText
protected boolean m_OutputText
whether to output the OCRed text instead of the files.
-
m_Separator
protected String m_Separator
the separator between multiple text files.
-
m_OutputHOCR
protected boolean m_OutputHOCR
whether to output hOCR instead of ASCII.
-
m_Configuration
protected TesseractConfiguration m_Configuration
the tesseract connection to use.
-
m_ProcessOutput
protected transient com.github.fracpete.processoutput4j.output.CollectingProcessOutput m_ProcessOutput
for executing tesseract.
-
-
Method Detail
-
globalInfo
public String globalInfo()
Returns a string describing the object.- Specified by:
globalInfo
in interfaceGlobalInfoSupporter
- Specified by:
globalInfo
in classAbstractOptionHandler
- Returns:
- a description suitable for displaying in the gui
-
defineOptions
public void defineOptions()
Adds options to the internal list of options.- Specified by:
defineOptions
in interfaceOptionHandler
- Overrides:
defineOptions
in classAbstractActor
-
getQuickInfo
public String getQuickInfo()
Returns a quick info about the actor, which will be displayed in the GUI.- Specified by:
getQuickInfo
in interfaceActor
- Specified by:
getQuickInfo
in interfaceQuickInfoSupporter
- Overrides:
getQuickInfo
in classAbstractActor
- Returns:
- null if no info available, otherwise short string
-
setLanguage
public void setLanguage(TesseractLanguage value)
Sets the language to use (needs to be installed).- Parameters:
value
- the language
-
getLanguage
public TesseractLanguage getLanguage()
Returns the language to use.- Returns:
- the language
-
languageTipText
public String languageTipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the GUI or for listing the options.
-
setPageSegmentation
public void setPageSegmentation(TesseractPageSegmentation value)
Sets the page segmentation to use.- Parameters:
value
- the page segmentation
-
getPageSegmentation
public TesseractPageSegmentation getPageSegmentation()
Returns the page segmentation to use.- Returns:
- the page segmentation
-
pageSegmentationTipText
public String pageSegmentationTipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the GUI or for listing the options.
-
setOutputBase
public void setOutputBase(PlaceholderFile value)
Sets the base name for the generated file(s).- Parameters:
value
- the base name
-
getOutputBase
public PlaceholderFile getOutputBase()
Returns the base name for the generated file(s).- Returns:
- the base name
-
outputBaseTipText
public String outputBaseTipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the GUI or for listing the options.
-
setOutputText
public void setOutputText(boolean value)
Sets whether to output the content of all files rather than the files.- Parameters:
value
- true if to output the text
-
getOutputText
public boolean getOutputText()
Returns whether to output the content of all files rather than the files.- Returns:
- true if text is output
-
outputTextTipText
public String outputTextTipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the GUI or for listing the options.
-
setSeparator
public void setSeparator(String value)
Sets the separator between text files, in case text is being output rather than file names.- Parameters:
value
- the backquoted separator
-
getSeparator
public String getSeparator()
Returns the separator between text files, in case text is being output rather than file names.- Returns:
- the backquoted separator
-
separatorTipText
public String separatorTipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the GUI or for listing the options.
-
setOutputHOCR
public void setOutputHOCR(boolean value)
Sets whether to use hOCR format instead of ASCII.- Parameters:
value
- true if to output hOCR
-
getOutputHOCR
public boolean getOutputHOCR()
Returns whether to use hOCR format as output instead of ASCII.- Returns:
- true if to output hOCR
-
outputHOCRTipText
public String outputHOCRTipText()
Returns the tip text for this property.- Returns:
- tip text for this property suitable for displaying in the GUI or for listing the options.
-
setUp
public String setUp()
Initializes the item for flow execution.- Specified by:
setUp
in interfaceActor
- Overrides:
setUp
in classAbstractActor
- Returns:
- null if everything is fine, otherwise error message
- See Also:
AbstractActor.reset()
-
accepts
public Class[] accepts()
Returns the class that the consumer accepts.- Returns:
- the Class of objects that can be processed
-
generates
public Class[] generates()
Returns the class of objects that it generates.- Returns:
- the Class of the generated tokens
-
doExecute
protected String doExecute()
Executes the flow item.- Specified by:
doExecute
in classAbstractActor
- Returns:
- null if everything is fine, otherwise error message
-
stopExecution
public void stopExecution()
Stops the execution. No message set.- Specified by:
stopExecution
in interfaceActor
- Specified by:
stopExecution
in interfaceStoppable
- Overrides:
stopExecution
in classAbstractActor
-
-