OCR a PDF Document

Overview

The 'OCR a PDF Document' flow action checks whether the PDF document provided contains a text layer, if a text layer does not exist then OCR is performed against the document.

Default Parameters

The default 'OCR a PDF Document' flow action parameters are detailed below:

  • Filename: The PDF filename (including file extension).
  • File Content: (Optional) A Base64 encoded representation of the PDF file to be processed.
  • Operation ID: (Optional) The ID of a parent operation, please refer to: Flow Action Return Options: File Content vs. Operation ID
  • Clean Operations: Set whether page level clean-up operations should be performed, the default option will auto rotate, auto deskew and auto despeckle each page within the PDF Document.
  • Remove Blank Pages: Sets whether blank pages should be removed from the resultant PDF document.
  • PDF/A Compliant: Sets whether the resulting document should conform to PDF/A format.

1.jpg

Please refer to the Obtaining the 'File Contents' Parameter article for guidance on how to obtain the 'File Content' parameter ready to provide to an Encodian flow action. 

Additional Guidance

If the PDF document contains a text layer and you wish to forcibly perform OCR, please the 'Clean Operations' parameter is not set to 'None'.

The 'OCR a PDF Document' flow action will increase the PDF document file size. 

Operation Count

OCR is a resource intensive operation, therefor a single OCR operation is recorded for every two pages OCR'd. For example, a 10 page document equates to 5 operations. 

Advanced Parameters

The advanced 'OCR a PDF Document' flow action parameters are detailed below:

  • PDF/A Compliance Level: Sets the required level of PDF/A compliance
  • Rotate: Automatically detects orientation and rotates it so that the text on it is upright oriented.
  • Rotate Confidence Level: Sets the minimum confidence percentage (0 to 100) used to control whether the rotation is applied.
  • Deskew: Detects the skew angle and rotates to remove that skew.
  • Despeckle: Automatically detects speckles and removes them.
  • Adjust Brightness and Contrast: This action analyses a document and automatically adjusts brightness and contrast based on the analysis.
  • Remove Border: Locates border pixels and removes the pixels from the document.
  • Smooth Background: This works only on colour and grayscale documents. This operation smooths background colours to eliminate or de-emphasise noise.
  • Smooth Objects: This only works on bi-tonal documents, it looks at groups of pixels, and finds isolated bumps and pits in the edges of those objects, and fills them in.
  • Remove Dot Shading: This action will remove shaded regions from bi-tonal documents.
  • Image Detergent: Image Detergent works by changing pixels of similar colour values to a central colour value, which has the result of smoothing the image wherever regions of those colours appear.
  • Average Filter: Performs a 3x3 Average filter smoothing operation on the document, placing the output in the centre of the window.
  • Remove Hole Punch: Detects and removes hole punch marks from a bi-tonal document.
  • Binarize: Computes all necessary parameters by analysing the input data before actually performing the binarization. The algorithm is tuned to typical document images, consisting of dark text on brighter background. It is robust to shadows, noise and background images.
  • Final Operation: Sets whether the this is last Encodian flow action.

2.jpg

Please refer to the Flow Action Return Options: File Content vs. Operation ID article for further details on the 'Final Operation' parameter.

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk