Sunday, April 25, 2010

PDF Documents and OWB11GR2 ???




Note: For ODI PDF Post click here

Recently, I was asked a question how flexible is OWB11GR2 ?
Can it extract data from a PDF ?

Source shouldn't be a problem for OWB11GR2. But PDF's are different.

PDF documents have different flavours.
Some of the PDF's are just plain text (read/print only).
Some allow you to save data via formfields (key value pair) within them. Some (interactive forms) can even send data back to the server with click of a button.

In quest to extract data from PDF's , came across few Open Source Java PDF Libraies (API) which allow extraction\manipulation of PDF documents.

List of Open Source PDF Libraries in Java can be found here.

PDFBox and iText are the popular ones. I was able to use them with OWB11GR2 and extract data from PDF.


A quick Demo --------->

Task -> Extracting values from PDF's formfield
Note: Will be refering formfield's as FF from here.

1) Utilizing the power of OWB11GR2's architecture to integrate the Java API's:

Code templates (CT) are the crux of the new architecture.

Designed a Load Code Template LCT_PDF_TO_SQL to extract values from FF. LCT can extract from different PDF datatypes like text fields, checkboxes ,radio buttons,etc.



LCT consists of few JDBC calls to drop and create the work tables. The important task of extracting the values from FF and inserting data into the work tables is done via Jython Task (Jython Code).

Jython task uses Java api's to extract values. All the other metadata regarding which columns to extract are feed via ODI substitution method. This task also generates a bad file(error information) for every unprocessed pdf.

Note:
For using PDFBox, pdfbox-1.1.0.jar and commons-logging-1.1.1.jar are needed which can be downloaded from here.
For using iText, iText-5.0.2.jar is needed which can be downloaded from here.
To make it accessible , add the jar files to (OWB_HOME)/lib/ext and restart the control centre agent.

2) Representing PDF files within the OWB Client:

Text File is a nice option.
Text file provides location (Schema/directory). This can be used to point to a directory where PDF's are stored. The FF on the PDF will act as a column.


3) Sampling the PDF File:

For demo --> I will be using W-4 PDF form (most of us fill this at the start year).
The W-4 form can be downloaded from here.

Filled a W-4 form with random data.



Sampled the above PDF file via a jython script -->

-------------------------------------------------
Position --> Form Field(FF) --> Value
-------------------------------------------------
1 --> f1_01(0) --> 1
2 --> f1_02(0) --> 1
3 --> f1_03(0) --> 0
4 --> f1_04(0) --> 0
5 --> f1_05(0) --> 0
6 --> f1_06(0) --> 0
7 --> f1_07(0) --> 0
8 --> f1_08(0) --> 2
9 --> f1_09(0) --> BOB B
10 --> f1_10(0) --> BOB
11 --> f1_11(0) --> 000
12 --> f1_12(0) --> 00
13 --> f1_13(0) --> 0000
14 --> f1_14(0) --> 000 Cleveland ln
15 --> f1_15(0) --> Columbus OH - 43081
16 --> c1_01(0) -->
17 --> c1_01a(0) --> Yes
18 --> c1_01b(0) -->
19 --> c1_04(0) -->
20 --> f1_16(0) --> 7
21 --> f1_17(0) --> 500
22 --> f1_18(0) --> Exempt
23 --> f1_19(0) --> XYZ TEST CORPORATION
24 --> f1_20(0) --> XYZ
25 --> f1_21(0) --> AA
26 --> f1_22(0) --> AAAAAAA
27 --> f2_01(0) --> 12
28 --> f2_02(0) --> 12
29 --> f2_03(0) --> 13
30 --> f2_04(0) --> 124124
31 --> f2_05(0) --> 123
32 --> f2_06(0) --> 123
33 --> f2_07(0) --> 132
34 --> f2_08(0) --> 123
35 --> f2_09(0) --> 13
36 --> f2_10(0) --> 123
37 --> f2_11(0) --> 123
38 --> f2_12(0) --> 123
39 --> f2_13(0) --> 123123
40 --> f2_14(0) --> 123123
41 --> f2_15(0) --> 123123
42 --> f2_16(0) --> 1231
43 --> f2_17(0) --> 123
44 --> f2_18(0) --> 123
45 --> f2_19(0) --> 123

Tried with both PDFBox and iText and got the same result.

4) Handling the FF names:

The FF names are not meaningful (might be useful to the guys in payroll dept) and have special characters.
Checked few other pdf's and found that FF can have special characters like brackets, spaces,hypens,etc.

Columns with meaningful names makes more sense. Changing FF name in the PDF is not a good option.

To overcome this problem of special characters decided, to use position of the FF within the PDF rather than the name.
This will allow to choose FF's without worrying about its original name.

5) CT mapping design to extract data:

Created a Text file with meaningful column names suffixed with position number.



6) Execution view of the mapping with code template assigned:



7) Execution Result:




 Got the data.

8) Performance Testing:

Tested it against 101 pdf files. Jython task took around 3 seconds to parse it. Entire CT map took 10 seconds. Coooooooool.




Extracting text and Parsing it from PDF's is also possible using TextStripper option avaialbe with Java API's.

The above example is one of the many alternatives available to extract data from PDF. If you know any other way, comments are welcome.

It is easy to design , integrate and implement different things using such a powerful tool. Any team will love to have such a tool.

What next -> Generating PDF's files(W-2 ,1099-INT) on the fly.

No problem.. Bring it on :-)

Code Template can be downloaded from here. It uses iText api.

13 comments:

  1. Good Article. Never thought this was possible.

    - Thomas

    ReplyDelete
  2. Nice post, can you post the code template?

    ReplyDelete
  3. I will upload the Code template and update the blog.

    -Suraj

    ReplyDelete
  4. interesting to see this, thanks to Carnot ( oracle product ) to lead me here
    ..herrick

    ReplyDelete
  5. Surprised after seeing this.
    Please update with Code Template, that will be great.

    ReplyDelete
  6. Thanks.

    Is this possible using ODI ?

    -Mike

    ReplyDelete
  7. Mike,

    Architecture is the same, so it should work with ODI as well. I will try to create a LKM when i get time.

    -Suraj

    ReplyDelete
  8. I can't believe it. Time for upgrade.

    Thanks, Jin

    ReplyDelete
  9. Suraj,

    Friend of mine referred to your blog. Nice blog. Do you have a Code Template for extracting text from PDF ?

    Vishal.

    ReplyDelete
  10. Vishal,

    I don't have CT for extracting text from PDF. But you can modify the existing one to use the PDFBox Text stripper option

    http://www.pdfbox.org/userguide/text_extraction.html

    You have to write code around it to parse text and use it the way you want.

    Another option -
    If you are on linux box , the new ones come with python installed and you can add pyPDF library. Its worth trying with pyPDF library. In my opinion Python should be faster than Java atleast for text parsing.

    You can post on OWB/ODI forum for more alternatives. There is more experience.

    -Suraj.

    ReplyDelete
  11. Thanks Suraj.

    -Vishal.

    ReplyDelete
  12. Interesting ..
    Can it handle password encrypted pdf files ?
    What if there are 200K files , will it be fast enough ?

    Daniel

    ReplyDelete
  13. Daniel,

    Password encrypted files can be easily handled as the iText or PDFBox already support it. Regarding performance , 200K files is not an issue template can handle it. Performance can certainly be increased by customzing the template to extract data from pdf files in parallel using threading.

    -Suraj

    ReplyDelete