Pentaho Geek Zone : 2017

Process PDF files in Pentaho kettle.

--------------------------------------------

Prerequisite :-

Pdf file reader java/jar file required

Please Download this from below location

https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing

How to read pdf files through Pentaho PDI Kettle.

1.Use Get PDF File Names step to take name of pdf files.

2. you use wild card if if you want to process multiple files (.*.pdf).

3.now use copy rows to result to put pdf filename and their location to stream

so that this can be used as parameter in next transformation.

4. Create variable and assign value of pdf filename and their location on them.(refer image below)

5.Now use shell script step to execute below command

java -jar ${path}\..\pdfjar\pdf_parsing.jar ExtractText ${path}\${short_filename} ${path}\..\Output\text\${short_filename}.txt

you can use hard coded path as well

example:

java -jar D:\1_pdfparsing\pdfjar\pdf_parsing.jar ExtractText D:\1_pdfparsing\pdfFiles\abhi.pdf D:\1_pdfparsing\pdfFiles.txt

6.In PDI use shell script step to execute above command.

Download working copy of above example from download PDI PDF process

https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing

4. Create variable and assign value of pdf filename and their location on them.

5.Now use shell script step to execute below command

java -jar ${path}\..\pdfjar\pdf_parsing.jar ExtractText ${path}\${short_filename} ${path}\..\Output\text\${short_filename}.txt

you can use hard coded path as well

example:

java -jar D:\1_pdfparsing\pdfjar\pdf_parsing.jar ExtractText D:\1_pdfparsing\pdfFiles\abhi.pdf D:\1_pdfparsing\pdfFiles.txt

In PDI use shell script step to execute above command

Include all step in one Job

Download working copy of above example from download PDI PDF process

https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing

Pentaho Geek Zone