Saturday 25 March 2017

How to process PDF file in PDI pentaho Kettle

Process PDF files in Pentaho kettle.
--------------------------------------------
Prerequisite :- 
 Pdf file reader java/jar file required 
Please Download this from below location

https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing

How to read pdf files through Pentaho PDI Kettle.

1.Use Get PDF File Names step to take name of pdf files.
2. you use wild card if if you want to process multiple files (.*.pdf).
3.now use copy rows to result to put pdf filename and their location to stream
   so that this can be used as parameter in next transformation.
4. Create variable and assign value of  pdf filename and their location on them.(refer image below)

5.Now use shell script step to execute below command
java -jar ${path}\..\pdfjar\pdf_parsing.jar ExtractText ${path}\${short_filename}  ${path}\..\Output\text\${short_filename}.txt
you can use hard coded path as well
example:
java -jar D:\1_pdfparsing\pdfjar\pdf_parsing.jar ExtractText D:\1_pdfparsing\pdfFiles\abhi.pdf D:\1_pdfparsing\pdfFiles.txt

6.In PDI use shell script step to execute above command.

Download working copy of above example from download PDI PDF process

https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing







4. Create variable and assign value of  pdf filename and their location on them.


5.Now use shell script step to execute below command
java -jar ${path}\..\pdfjar\pdf_parsing.jar ExtractText ${path}\${short_filename}  ${path}\..\Output\text\${short_filename}.txt
you can use hard coded path as well
example:
java -jar D:\1_pdfparsing\pdfjar\pdf_parsing.jar ExtractText D:\1_pdfparsing\pdfFiles\abhi.pdf D:\1_pdfparsing\pdfFiles.txt







In PDI use shell script step to execute above command





Include all step in one Job





Download working copy of above example from download PDI PDF process

https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing