The Indxr application is designed to perform OCR PDF documents stored within a SharePoint Online document library.
Upon launching the application the user will be presented with the Application Settings page.
The following information is required:
- API Key - this will be provided to you when you purchase an Indxr licence
- Log File Location - the file system location to store the Indxr log file
- Execution Staging Directory - the file system location where documents can be stored whilst being processed
- Language - the default OCR language. Targets the correct language used in the files.
- Authentication Refresh Time (Minutes) - The time in minutes before the user authentication context is refreshed.
You will next be asked to enter the URL of the SharePoint Online site where the documents requiring OCR are stored and select an authentication method.
Authentication methods that are used are:
- Email and Password - this is the recommended approach. If you have MFA enforced on your account this approach will not work with your standard password, instead, you will need to create an application password. Instructions on how to do this as per the relevant configuration for your organisation can be found in this Microsoft support article
- Browser - this method will use the authentication session currently in use by your browser. If there is no session when clicking Next will prompt the user to enter their SharePoint Online credentials and if necessary provide multi-factor authentication.
Subject to successful authentication the user can then configure SharePoint and Performance settings.
SharePoint settings include:
- Source Library - the library containing the documents that you would like to OCR
- Source Folder - the specific folder containing the documents for OCR. If the source library is below the SharePoint List View Threshold then this list will be populated with available top-level folders. If the number of items in the library exceeds the list view threshold then you will be able to manually input the folder name which will be validated prior to execution. Only folders containing documents will be displayed.
- Target Library - the library where the OCRed documents will be created. This can be the same as the source library.
- Copy metadata will transfer metadata from the source document to the target document. This will only succeed if the necessary fields are configured on the target library.
- Copy permission will transfer the permissions from the source document to the target document. Please note that the user executing Indxr will require full control permissions on the source files in order to be able to transfer permissions to the destination.
- Force OCR will ensure the application carries out OCR on all PDF files regardless of whether they already have a text layer or not.
- Overwrite Source File will be available if the source and target libraries are the same.
- New File Prefix will be available if the source and target libraries are the same but the Overwrite Source File setting is false. This is to ensure a unique filename is created.
Performance settings allow the user to configure the following items.
- Document Threads control how many documents can be processed concurrently*
- Page Threads control how many pages can be processed concurrently*
- Performance can be set to either fast or quality. Quality will provide the best results but will increase processing times
- Image Clean up will perform deskew, despeckle, rotation and brightness and contrast adjustments before OCR is performed
* When adjusting these values, consideration needs to be given to the hardware of the machine hosting the tool. If processing power and memory on the host machine are not sufficient then the tool will not be able to execute at the configured performance levels. Placing both values to the maximum available will not necessarily result in better performance as it may result in more locking as multiple threads attempt to access the shared resources.
Clicking "Execute" will start the OCR job.
A summary is provided of the settings associated with the job. The console will be dynamically updated as each document is processed and a summary of the processing will be provided at the conclusion of the execution.
The user has the option to export the console to a CSV file and also open the log file to examine any processing errors.
The entire operation can be cancelled at any point by the user clicking the "Cancel" button. This will complete the processing of the current document before stopping the execution job.
Multiple instances of Indxr can be executed simultaneously targeting different document libraries. However, it is recommended that the performance of the host machine is monitored closely.
Handling Large Libraries
To improve performance and allow for concurrent processing, we recommend that where possible, large document libraries are split into smaller libraries with item counts below the 5000 SharePoint list view threshold limit. Indxr will work across larger repositories but processing times may be affected.
Handling Expectations
OCR is a complex and resource-intensive operation. Processing of large repositories can easily take days or weeks. Indxr is designed so that multiple instances can be executed simultaneously with no additional licensing overhead.
0 Comments