adams.flow.transformer.TesseractOCR

Name

adams.flow.transformer.TesseractOCR

Synopsis

Applies OCR to the incoming image file using Tesseract.
In case of successful OCR, either the file names of the generated files are broadcast or the combined text of the files.
NB: The actor deletes all files that have the same prefix as the specified output base. Something you need to be aware of when doing OCR in parallel or generate other files with the same prefix.

For more information on tesseract see:
https://github.com/tesseract-ocr/tesseract

For more information on hOCR see:
https://en.wikipedia.org/wiki/HOCR

Additional information

Flow input/output:
- input: java.lang.String, java.io.File, adams.data.image.AbstractImageContainer
- output: java.lang.String[]

Options

loggingLevel

The logging level for outputting errors and debugging output.

command-line -logging-level <OFF|SEVERE|WARNING|INFO|CONFIG|FINE|FINER|FINEST>

default WARNING

min-user-mode Expert
name

The name of the actor.

command-line -name <java.lang.String>

default TesseractOCR
annotations

The annotations to attach to this actor.

command-line -annotation <adams.core.base.BaseAnnotation>

default
skip

If set to true, transformation is skipped and the input token is just forwarded as it is.

command-line -skip <boolean>

default false
stopFlowOnError

If set to true, the flow execution at this level gets stopped in case this actor encounters an error; the error gets propagated; useful for critical actors.

command-line -stop-flow-on-error <boolean>

default false

min-user-mode Expert
silent

If enabled, then no errors are output in the console; Note: the enclosing actor handler must have this enabled as well.

command-line -silent <boolean>

default false

min-user-mode Expert

language

The language to use for OCR (must be installed).

default ENGLISH

pageSegmentation

The page segementation to use.

default FULL_AUTO_NO_OSD

outputBase

The base name for the generated file(s).

command-line -output-base <adams.core.io.PlaceholderFile>

default ${TMP}/outputbase
outputText

If enabled, text combined text of all generated files is output rather than the file names.

command-line -output-text <boolean>

default false
separator

The separator used between the content of two files if text rather than the file names is forwarded; you can use special characters like \n and \t as well

command-line -separator <java.lang.String>

default
outputHOCR

If enabled, HTML files using the hOCR format are generated rather than ASCII files.

command-line -output-hocr <boolean>

default false

command-line	`-logging-level <OFF\|SEVERE\|WARNING\|INFO\|CONFIG\|FINE\|FINER\|FINEST>`
default	`WARNING`
min-user-mode	`Expert`

command-line	`-annotation <adams.core.base.BaseAnnotation>`
default

command-line	`-stop-flow-on-error <boolean>`
default	`false`
min-user-mode	`Expert`

command-line	`-silent <boolean>`
default	`false`
min-user-mode	`Expert`

command-line	`-output-base <adams.core.io.PlaceholderFile>`
default	`${TMP}/outputbase`

command-line	`-name <java.lang.String>`
default	`TesseractOCR`

command-line	`-separator <java.lang.String>`
default