In the 'Perform OCR' configuration mode the user can add one or more SharePoint sites to the configuration to perform OCR against PDFs held in all or selected document libraries.
The managed path drop down can be used to select from the predefined SharePoint Online managed paths. None is provided as an option to enable the root site within the tenant to be targeted.
Upon adding a site a settings window is presented whereby the user can select 'All Libraries' or select specific libraries to perform OCR against.
The following settings are provided to configure the OCR job:
- Copy Metadata - will transfer the metadata from the source file to the destination file
- Copy Permissions - will transfer any unique permissions from the source file to the destination file
- Overwrite Source File - will replace the source file in the same library with the new file
- New File Prefix - will prefix the configured string on the destination filename
- Force OCR - will OCR the document regardless of whether there is an existing text layer detected or not
- Clean up images before OCR - will perform basic image clean up operations such as deskew and despeckle prior to OCR
Once added the Site will appear in the list. The pencil icon can be used to edit the settings pertaining to the site and the bin icon will remove the site from the OCR operation.
Save Configuration can be used to store the configuration in a separate .indxr file that can be reloaded at a later date.
Once the list of sites has been added click "Execute" to run the OCR job.
Progression of the job will be displayed to the user and the results shown when the job has been completed.
When complete the results can be exported to a .csv file for further analysis or distribution and the .txt log file can be opened in a suitable application.
Handling Large Libraries
To improve performance and allow for concurrent processing, we recommend that where possible, large document libraries are split into smaller libraries with item counts below the 5000 SharePoint list view threshold limit. Indxr will work across larger repositories but processing times may be affected.
OCR is a complex and resource-intensive operation. Processing of large repositories can easily take days, however Indxr is designed so that multiple instances can be executed simultaneously with no additional licensing overhead.