My Little Blog: PDF Data Extraction

Sunday, November 28, 2010

PDF Data Extraction - ODI

--

In this post we will look at how to extract data from PDF documents using ODI.

This post will concentrate on extracting data from PDF form fields (FF).

For demo , I will be using W-4 PDF form.

There are few challenges -

1) To extract metadata or data from PDF documents we need a set of Java api's which can handle PDF's internal format to extract metadata and data.

Solution - There are tons of open source PDF Libraries in Java which can help us easily overcome this issue. List of Open Source PDF Libraries in Java can be found here.

This post uses iText. For using iText, download the latest jar file from here.
To make it accessible , add the jar file to your path ..\odi\oracledi\userlib either under your user or the agent you are using.

2) Represent PDF as a flat file and then form fields can be treated as columns. This will provide the flexibility to choose form fields of our choice.

Solution - Created an LKM (Load Knowledge Module) which will process all the PDF files in an directory and help us extra data from FF of our choice.
It can extract from different PDF datatypes like text fields, checkboxes ,radio buttons,etc.

So here are few simple steps to extract data form PDF files and load it into a database table.

Extracting Metadata to represent PDF document as a flat file:

The following ODI procedure extracts metadata and data from a single PDF file and writes it to an output file.

import os, sys
from com.itextpdf.text.pdf import PdfReader

# change the input variables as per your requirement
OutputFile="c:\\temp\\pdfs\\PDFMetaData.txt"
pdfFile="c:\\temp\\pdfs\\w4_1.pdf"
delimiter="|"
generateFieldNames="Y"
generateInternalFieldNames="Y"
generateFieldValues="Y"
columnNameFormat="COLUMN_"

##############################
###### start of main #########
##############################
pos=0
reader = PdfReader(pdfFile)

# calling iText api's to get formfields and formvalues
for ffname in reader.getAcroForm().getFields():
    ffvalue = str(reader.getAcroFields().getField(ffname.getName()))
    pos = pos+1
    if pos == 1:
       if generateFieldNames == "Y":
          colNames = columnNameFormat + str(pos) 
       if generateFieldValues == "Y":
          colValues =  ffvalue 
       if generateInternalFieldNames == "Y":
          internalNames =  ffname.getName()
    else:
       if generateFieldNames == "Y":
          colNames = colNames +  delimiter + columnNameFormat + str(pos)
       if generateFieldValues == "Y":
          colValues = colValues +  delimiter +  ffvalue  
       if generateInternalFieldNames == "Y":
          internalNames = internalNames + delimiter + ffname.getName()


filehandle= open(OutputFile,'w')
if generateFieldNames == "Y":
    filehandle.write(colNames + '\n')
if generateInternalFieldNames == "Y":
    filehandle.write( internalNames + '\n')
if generateFieldValues == "Y":
    filehandle.write(colValues + '\n')
filehandle.close()

reader.close()

The output file -

The output file generates three lines.

First Line - Dummy Column names with position prefix attached like COLUMN_1,COLUMN_2,COLUMN_3,etc

Second Line - Data f1_01(0) , f1_02(0),etc
These are internal representation of the form fields within the PDF file.
They are not meaningful and can have special characters. These can very difficult to represent it in ODI.

Third Line - It's the data within the form fields.

We will be using the first line as our column names to represent the PDF file.

Importing the file -

Changing column names to some meaningful names -

Note: It is important to have the position prefix , since we are not using internal representation of the PDF form field name as it can have special characters.

LKM which does its own magic -

LKM consists of few JDBC calls to drop and create the work tables. The important task of extracting the values from FF and inserting data into the work tables is done via Jython Task (Jython Code).

Jython task uses Java api's to extract values. All the other metadata regarding which columns to extract are feed via ODI substitution method. This task also generates a bad file(error information) for every unprocessed pdf.

The LKM can be downloaded from here.

Note: LKM above works with old iText API library. David Allan has uploaded the modified code which works latest iText API library and can be downloaded from here.

Interface to load data from PDF files to a table -

Applying the LKM to the Interface -

The Input PDF files -

Output after execution of Interface -

Extracting text and Parsing it from PDF's is also possible using TextStripper option availabe with Java API's.

The above example is one of the many alternatives available to extract data from PDF.

--