PDF Import Scrape

PDF import using a predefined scrape profile provides batch importing and automatic indexing based on criteria set within task and scrape profiles using a Windows server task scheduler job.

During the capture process data can be extracted for indexing from the first page of the PDF and/or the PDF filename. To ensure a successful import and extraction of indexing data from the first page of the PDF, the PDF must be searchable and match the layout of the original PDF used to create the scrape profile. To ensure a successful import and extraction of indexing data from the PDF filename, the PDF filename rule(s) must be set within the task profile.

3 Ways to Extract Index Data:

Extract data from the PDF filename.
Extract data from the PDF document. (PDF must contain searchable text)
Extract data from the PDF filename and document. (PDF must contain searchable text)

1. Create Scrape Profile

From Maintenance, click Maintain Scrape Profiles.

Click Create New Profile.

Enter a Profile Name, Description, and select a System.

Click Create New Profile to finalize.

Click Load PDF Data.

Navigate to the PDF then click Select PDF.

Click Import PDF.

Click the Search Data tab to view the first page of the PDF. Specify a Unique Scrape Identifier at the top of the screen.

The Unique Scrape Identifier is the specific identifier for the PDF being captured (case sensitive). In order to use a unique value(s) from the PDF document, you must make sure this value does not exist in any other document that will be defined in scrape maintenance (It must be completely unique whether you use a single value or concatenate multiple values using double quotes.)

Example: PDF Report AAA Purchasing and PDF Report AAA. PDF Report AAA exists in both PDFs with the only difference being the word Purchasing so you will NOT be able to use this as the unique id.

In order to use the scrape profile name as the unique identifier, the Scrape Profile name and Unique Scrape Identifier must match exactly. The scrape profile name will also need to match the task profile name when it is created in Step 2.

If you are able to define a unique identifier for all your PDF documents, you will be able to process all the imports from one folder. If you are not able to define a unique scrape identifier for all your PDF documents, you must put each document in a separate folder.

Once the Unique Scrape Identifier is defined, click Update Profile.

Click the Index Definitions tab.

Predefined rules will be used to identify location of index data. Variable data defined on the form can alter the location of where the information is located.

Example: An address may contain three address lines on one form and two address lines on another form. You may have to use different PDF documents when setting up and testing scrape definitions.

Click Rule Selection to access the available index definitions. Pick an algorithm to use by double-clicking the entry.

Maintenance - Scrape Profile Rule Selection

Even with this many rules, it is possible that some data may not be able to be located.If an index value can not be defined, it may be possible to have that values defined in the PDF file name which can be used as long as it is in a fixed location.

Select Value Maintenance to define the Start Phrase.

If this index is mandatory, check the box next to ME _mandatory entry. During the import process if ME fields are blank the PDF will be moved to batch indexing.

Repeat these steps for every index you want to define.

Verify the capture results by selecting the Search Data tab.

Click Show Detail to run the process and display the results.

2. Create Task Profile

From Maintenance, click Maintain Task Profiles.

Click Add Import Profile.

Enter Profile Name, Profile Description, Image System ID.

NOTE: _The task profile name must match the scrape profile name IF your unique scrape identifier is not unique._Click Submit.

Set Indexing Flag: Y = adds the report to batch indexing, N = no batch indexing
Processing Program Name: = RVDPFTASK
Separator Type: (choose 0 or 1 for PDF import) 0 = No Separation (creates one document for each PDF) 1 = Fixed Page Separation (seperate each PDF by the defined separator page count)
Hexadecimal Page Separator Value: = (not used for PDF import)
Disable: = Check to disable this profile.
Import Images are Located Here: = The location of the PDFs to be imported.The server must have access to this folder location.
Import Scrape Profile to be Used: = Select an existing scrape profile from the drop down list. If you do not see your profile, verify the scrape profile is created and assigned to the correct system code.See Step 1 Create Scrape Profile for more information.
Route to This Profile: To route the PDF select an existing routing profile from the drop down list.
Index Constant Values: Set constant index value or specify a filename rule to be used for indexing.

3. Create Task Scheduler Job

Open the Task Scheduler on the Windows server.

Select Create Basic Task from the Actions pane on the right side of the screen.

Input a name and description for the task.

Select the trigger interval for when you want the job to run.

Set a date and time for the job to begin.

Set the action to Start a Program.

Set the proper parameters for the program location and arguments.

Program/script:
“C:\Program Files\PHP\v7.2\php.exe”
This program is considered custom because path information is hard coded by customer. The path should point to the location of the php.exe on the customer’s server.

Add arguments (optional):
C:\inetpub\wwwroot\RVI\PHP\commands\import\task_import.php 1 1

Start in (optional):
C:\inetpub\wwwroot\RVI\PHP\commands\import

Finalize the new task by clicking Finish.

Feedback

Post your comment on this topic.

Copyright 1992, 1999, Real Vision Software, Inc. (RVI), Alexandria, Louisiana. No part of this document may be reproduced, transmitted, transcribed, stored in a retrieval system, or translated into any language or computer language in any form or by any means without the express written consent of RVI Software, P.O. Box 12958, Alexandria, LA 71315-2958. However, copies of the One Look Manual can be reproduced, in whole or part, for use with the Real Vision Imaging System by Real Vision Software Licensee’s, for use with the imaging system.