adams.flow.transformer.TesseractOCR
Applies OCR to the incoming image file using Tesseract.
In case of successful OCR, either the file names of the generated files are broadcast or the combined text of the files.
NB: The actor deletes all files that have the same prefix as the specified output base. Something you need to be aware of when doing OCR in parallel or generate other files with the same prefix.
For more information on tesseract see:
https://github.com/tesseract-ocr/tesseract
For more information on hOCR see:
https://en.wikipedia.org/wiki/HOCR
Flow input/output:
- input: java.lang.String, java.io.File, adams.data.image.AbstractImageContainer
- output: java.lang.String[]
The logging level for outputting errors and debugging output.
command-line | -logging-level <OFF|SEVERE|WARNING|INFO|CONFIG|FINE|FINER|FINEST> |
default | WARNING |
min-user-mode | Expert |
The name of the actor.
command-line | -name <java.lang.String> |
default | TesseractOCR |
The annotations to attach to this actor.
command-line | -annotation <adams.core.base.BaseAnnotation> |
default |
|
If set to true, transformation is skipped and the input token is just forwarded as it is.
command-line | -skip <boolean> |
default | false |
If set to true, the flow execution at this level gets stopped in case this actor encounters an error; the error gets propagated; useful for critical actors.
command-line | -stop-flow-on-error <boolean> |
default | false |
min-user-mode | Expert |
If enabled, then no errors are output in the console; Note: the enclosing actor handler must have this enabled as well.
command-line | -silent <boolean> |
default | false |
min-user-mode | Expert |
The language to use for OCR (must be installed).
command-line | -language <Albanian|Arabic|Azerbauijani|Bulgarian|Catalan|Cherokee|Croation|Czech|Danish|Danish Fraktur|Dutch|English|Esperanto|Estonian|Finnish|French|Galician|German|Greek|Hebrew|Hindi|Hungarian|Indonesian|Italian|Japanese|Korean|Latvian|Lithuanian|Norwegian|Old English|Old French|Polish|Portuguese|Romanian|Russian|Serbian|Simplified Chinese|Slovakian|Slovenian|Spanish|Swedish|Tagalog|Tamil|Telugu|Thai|Traditional Chinese|Turkish|Ukrainian|Vietnamese> |
default | ENGLISH |
The page segementation to use.
command-line | -page-segmentation <Orientation and script detection (OSD) only|Automatic page segmentation with OSD|Automatic page segmentation| but no OSD| or OCR|Fully automatic page segmentation| but no OSD|Assume a single column of text of variable sizes|Assume a single uniform block of vertically aligned text|Assume a single uniform block of text|Treat the image as a single text line|Treat the image as a single word|Treat the image as a single word in a circle|Treat the image as a single character> |
default | FULL_AUTO_NO_OSD |
The base name for the generated file(s).
command-line | -output-base <adams.core.io.PlaceholderFile> |
default | ${TMP}/outputbase |
If enabled, text combined text of all generated files is output rather than the file names.
command-line | -output-text <boolean> |
default | false |
The separator used between the content of two files if text rather than the file names is forwarded; you can use special characters like \n and \t as well
command-line | -separator <java.lang.String> |
default |
|
If enabled, HTML files using the hOCR format are generated rather than ASCII files.
command-line | -output-hocr <boolean> |
default | false |