Extract Text from Regions

Overview

The 'Extract Text Regions' flow action enables text to be extracted from specified regions of a PDF document and returns an array of the extracted text. 

Whilst this action is limited to extracting text regions from PDF documents, simply convert files to PDF format using the 'Convert to PDF' flow action prior to executing this action to enable text regions to be extracted from 70+ different files types.

Please refer to the Supported Document Types articles for a complete list of the different file formats / document types which are supported for PDF conversion.

Example Flow

Please refer to the following article showcasing how to 'Zonally extract data from documents with Microsoft Flow'

Parameters

The default 'Extract Text from Regions' flow action parameters are detailed below:

  • Filename: The PDF filename (including the file extension)
  • File Content: A Base64 encoded representation of the PDF file to be processed.
  • Text Regions: An array of Text Regions (See below for further details)

mceclip0.png

Please refer to the Obtaining the 'File Contents' Parameter article for guidance on how to obtain the 'File Content' parameter ready to provide to an Encodian flow action. 

Text Region Generator Tool

Please use the 'Text Region Generator' tool to automatically determine the required coordinates.

2019-09-16_11-09-25.png

Text Region Detail

A text region is specified as a rectangle and is made up of 4 coordinates representing the bottom left of the rectangle on the X and Y axis and the upper right of the rectangle on the X and Y axis.

The origin (0,0) of the coordinate system is in the bottom left-hand corner of the page.  Coordinates are specified in points, a typical A4 page is 595 x 842 points.

  • Text Region - Multiple text regions can be selected in one operation.  To create more than one region click the "Add new item" button:
    • Text Regions Name: Provide a name with which to reference the extracted region
    • Text Regions Lower Left X Coordinate: Number of points across from the left-hand edge of the page to the lower left corner of the rectangle
    • Text Regions Lower Left Y Coordinate: Number of points up from the bottom edge of the page to the lower left corner of the rectangle
    • Text Regions Upper Right X Coordinate: Number of points across from the left-hand edge of the page to the upper right corner of the rectangle
    • Text Regions Upper Right Y Coordinate: Number of points up from the bottom edge of the page to the upper right corner of the rectangle.

Operation Count

The final  operation count is determined using the following calculation:

Extract Regions Operation (1) + (No. of Extractions / 10) = Total Operation Count

For example:

An action extracting 9 regions:

1 + (9/10) = 1 action

We are rounding down and not up.

An action extracting 9 regions:

1 + (11/10) = 2 actions

An action extracting 102 regions:

1 + (102/10 = 11 actions

Advanced Parameters

The advanced 'Extract Text from Regions' flow action parameters are detailed below:

mceclip1.png

Return Parameters

The 'Extract Text Regions' flow action returns the following data.

Action Specific Values

  • Text Region Results Simple - An array of results for each text region in simplified format (key / valuye pair).

A partial example response payload (JSON) is detailed below:

 "TextRegionResultsSimple":
{
"Region1": "Region1 Value",
"Region2": "Region2 Value",
"Region3": "Region3 Value"
}
  • Text Region Results - An array of results for each text region specified

A partial example response payload (JSON) is detailed below:

 "TextRegionResults": [
{
"Name": "Extracted Region Name",
"Text": "This is text extracted from the demo region",
"PageNumber": 1
}
]

To obtain a value from the 'Text Region Results' array a standard Filter Array Flow action can be used:

4.png

Standard Return Values

  • Filename - The filename of the document.
  • FileContent - The processed document content.
  • OperationId - The unique ID assigned to this operation.
  • HttpStatusCode - The HTTP Status code for the response.
  • HttpStatusMessage - The HTTP Status message for the response.
  • Errors - An array of error messages should an error occur.
  • Operation Status - Indicates whether the operation has been completed, has been queued or has failed.

A complete example return payload (JSON) is detailed below:

{
"TextRegionResults": [
{
"Name": "Extracted Region Name",
"Text": "This is text extracted from the demo region",
"PageNumber": 4
}
],
"HttpStatusCode": 200,
"HttpStatusMessage": "",
"OperationId": "**********-****-****-****-************",
"Errors": [],
"Operation Status": "Complete",
"Filename": "textRegionsDemo.pdf",
"FileContent": null
}

 

 

Have more questions? Submit a request

0 Comments

Please sign in to leave a comment.
Powered by Zendesk