Process PDF files in Pentaho kettle.
--------------------------------------------
Prerequisite :-
Pdf file reader java/jar file required
Please Download this from below location
https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing
How to read pdf files through Pentaho PDI Kettle.
1.Use Get PDF File Names step to take name of pdf files.
2. you use wild card if if you want to process multiple files (.*.pdf).
3.now use copy rows to result to put pdf filename and their location to stream
so that this can be used as parameter in next transformation.
4. Create variable and assign value of pdf filename and their location on them.(refer image below)
5.Now use shell script step to execute below command
java -jar ${path}\..\pdfjar\pdf_parsing.jar ExtractText ${path}\${short_filename} ${path}\..\Output\text\${short_filename}.txt
you can use hard coded path as well
example:
java -jar D:\1_pdfparsing\pdfjar\pdf_parsing.jar ExtractText D:\1_pdfparsing\pdfFiles\abhi.pdf D:\1_pdfparsing\pdfFiles.txt
6.In PDI use shell script step to execute above command.
Download working copy of above example from download PDI PDF process
https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing
4. Create variable and assign value of pdf filename and their location on them.
5.Now use shell script step to execute below command
java -jar ${path}\..\pdfjar\pdf_parsing.jar ExtractText ${path}\${short_filename} ${path}\..\Output\text\${short_filename}.txt
you can use hard coded path as well
example:
java -jar D:\1_pdfparsing\pdfjar\pdf_parsing.jar ExtractText D:\1_pdfparsing\pdfFiles\abhi.pdf D:\1_pdfparsing\pdfFiles.txt
In PDI use shell script step to execute above command
Include all step in one Job
Download working copy of above example from download PDI PDF process
https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing