Sunday, April 25, 2010

PDF Documents and OWB11GR2 ???




Note: For ODI PDF Post click here

Recently, I was asked a question how flexible is OWB11GR2 ?
Can it extract data from a PDF ?

Source shouldn't be a problem for OWB11GR2. But PDF's are different.

PDF documents have different flavours.
Some of the PDF's are just plain text (read/print only).
Some allow you to save data via formfields (key value pair) within them. Some (interactive forms) can even send data back to the server with click of a button.

In quest to extract data from PDF's , came across few Open Source Java PDF Libraies (API) which allow extraction\manipulation of PDF documents.

List of Open Source PDF Libraries in Java can be found here.

PDFBox and iText are the popular ones. I was able to use them with OWB11GR2 and extract data from PDF.


A quick Demo --------->

Task -> Extracting values from PDF's formfield
Note: Will be refering formfield's as FF from here.

1) Utilizing the power of OWB11GR2's architecture to integrate the Java API's:

Code templates (CT) are the crux of the new architecture.

Designed a Load Code Template LCT_PDF_TO_SQL to extract values from FF. LCT can extract from different PDF datatypes like text fields, checkboxes ,radio buttons,etc.



LCT consists of few JDBC calls to drop and create the work tables. The important task of extracting the values from FF and inserting data into the work tables is done via Jython Task (Jython Code).

Jython task uses Java api's to extract values. All the other metadata regarding which columns to extract are feed via ODI substitution method. This task also generates a bad file(error information) for every unprocessed pdf.

Note:
For using PDFBox, pdfbox-1.1.0.jar and commons-logging-1.1.1.jar are needed which can be downloaded from here.
For using iText, iText-5.0.2.jar is needed which can be downloaded from here.
To make it accessible , add the jar files to (OWB_HOME)/lib/ext and restart the control centre agent.

2) Representing PDF files within the OWB Client:

Text File is a nice option.
Text file provides location (Schema/directory). This can be used to point to a directory where PDF's are stored. The FF on the PDF will act as a column.


3) Sampling the PDF File:

For demo --> I will be using W-4 PDF form (most of us fill this at the start year).
The W-4 form can be downloaded from here.

Filled a W-4 form with random data.



Sampled the above PDF file via a jython script -->

-------------------------------------------------
Position --> Form Field(FF) --> Value
-------------------------------------------------
1 --> f1_01(0) --> 1
2 --> f1_02(0) --> 1
3 --> f1_03(0) --> 0
4 --> f1_04(0) --> 0
5 --> f1_05(0) --> 0
6 --> f1_06(0) --> 0
7 --> f1_07(0) --> 0
8 --> f1_08(0) --> 2
9 --> f1_09(0) --> BOB B
10 --> f1_10(0) --> BOB
11 --> f1_11(0) --> 000
12 --> f1_12(0) --> 00
13 --> f1_13(0) --> 0000
14 --> f1_14(0) --> 000 Cleveland ln
15 --> f1_15(0) --> Columbus OH - 43081
16 --> c1_01(0) -->
17 --> c1_01a(0) --> Yes
18 --> c1_01b(0) -->
19 --> c1_04(0) -->
20 --> f1_16(0) --> 7
21 --> f1_17(0) --> 500
22 --> f1_18(0) --> Exempt
23 --> f1_19(0) --> XYZ TEST CORPORATION
24 --> f1_20(0) --> XYZ
25 --> f1_21(0) --> AA
26 --> f1_22(0) --> AAAAAAA
27 --> f2_01(0) --> 12
28 --> f2_02(0) --> 12
29 --> f2_03(0) --> 13
30 --> f2_04(0) --> 124124
31 --> f2_05(0) --> 123
32 --> f2_06(0) --> 123
33 --> f2_07(0) --> 132
34 --> f2_08(0) --> 123
35 --> f2_09(0) --> 13
36 --> f2_10(0) --> 123
37 --> f2_11(0) --> 123
38 --> f2_12(0) --> 123
39 --> f2_13(0) --> 123123
40 --> f2_14(0) --> 123123
41 --> f2_15(0) --> 123123
42 --> f2_16(0) --> 1231
43 --> f2_17(0) --> 123
44 --> f2_18(0) --> 123
45 --> f2_19(0) --> 123

Tried with both PDFBox and iText and got the same result.

4) Handling the FF names:

The FF names are not meaningful (might be useful to the guys in payroll dept) and have special characters.
Checked few other pdf's and found that FF can have special characters like brackets, spaces,hypens,etc.

Columns with meaningful names makes more sense. Changing FF name in the PDF is not a good option.

To overcome this problem of special characters decided, to use position of the FF within the PDF rather than the name.
This will allow to choose FF's without worrying about its original name.

5) CT mapping design to extract data:

Created a Text file with meaningful column names suffixed with position number.



6) Execution view of the mapping with code template assigned:



7) Execution Result:




 Got the data.

8) Performance Testing:

Tested it against 101 pdf files. Jython task took around 3 seconds to parse it. Entire CT map took 10 seconds. Coooooooool.




Extracting text and Parsing it from PDF's is also possible using TextStripper option avaialbe with Java API's.

The above example is one of the many alternatives available to extract data from PDF. If you know any other way, comments are welcome.

It is easy to design , integrate and implement different things using such a powerful tool. Any team will love to have such a tool.

What next -> Generating PDF's files(W-2 ,1099-INT) on the fly.

No problem.. Bring it on :-)

Code Template can be downloaded from here. It uses iText api.

Friday, April 23, 2010

OMB commands for Code Template Mappings

--

Some useful OMB commands while working with CT mappings --

# Get execution units of a CT map
OMBRETRIEVE MAPPING 'MAP_NAME' GET EXECUTION_UNITS

# Create and delete execution units for a CT map
OMBALTER MAPPING 'MAP_NAME' ADD EXECUTION_UNIT 'EXECUTIONUNIT_NAME' OPERATORS ('OPERATORNAME_1','OPERATORNAME_2','..OPERATOR_NAME_N')

OMBALTER MAPPING 'MAP_NAME' DELETE EXECUTION_UNIT 'EXECUTIONUNIT_NAME'

# Create default execution units for a CT map
OMBALTER MAPPING 'MAP_NAME' SET DEFAULT_EXECUTION_UNIT


# Get and Set Code Template for an Execution unit on a CT map
OMBRETRIEVE MAPPING 'MAP_NAME' GET EXECUTION_UNIT 'EXECUTIONUNIT_NAME' \
GET REF CODE_TEMPLATE

OMBALTER MAPPING 'MAP_NAME' MODIFY EXECUTION_UNIT 'EXECUTIONUNIT_NAME' \
SET REF CODE_TEMPLATE '/PUBLIC_PROJECT/BUILT_IN_CT/DEFAULT_ORACLE_TARGET_CT'


# Get all Code Template options of an applied code template to an execution unit of
# a CT map
OMBRETRIEVE MAPPING 'MAP_NAME' GET EXECUTION_UNIT 'EXECUTIONUNIT_NAME' \
GET REF CODE_TEMPLATE OPTIONS

# Get & set code template option value of an applied code template to an execution
# unit of a CT map
OMBRETRIEVE MAPPING 'MAP_NAME' GET EXECUTION_UNIT 'EXECUTIONUNIT_NAME' \
GET REF CODE_TEMPLATE OPTIONS('OPTION_NAME')

OMBALTER MAPPING 'MAP_NAME' MODIFY EXECUTION_UNIT 'EXECUTIONUNIT_NAME' \
SET REF CODE_TEMPLATE OPTIONS('OPTION_NAME') VALUES ('NEW_VALUE')

--