Process PDF files in Pentaho kettle.
--------------------------------------------
Prerequisite :-
Pdf file reader java/jar file required
Please Download this from below location
https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing
How to read pdf files through Pentaho PDI Kettle.
1.Use Get PDF File Names step to take name of pdf files.
2. you use wild card if if you want to process multiple files (.*.pdf).
3.now use copy rows to result to put pdf filename and their location to stream
so that this can be used as parameter in next transformation.
4. Create variable and assign value of pdf filename and their location on them.(refer image below)
5.Now use shell script step to execute below command
java -jar ${path}\..\pdfjar\pdf_parsing.jar ExtractText ${path}\${short_filename} ${path}\..\Output\text\${short_filename}.txt
you can use hard coded path as well
example:
java -jar D:\1_pdfparsing\pdfjar\pdf_parsing.jar ExtractText D:\1_pdfparsing\pdfFiles\abhi.pdf D:\1_pdfparsing\pdfFiles.txt
6.In PDI use shell script step to execute above command.
Download working copy of above example from download PDI PDF process
https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing
4. Create variable and assign value of pdf filename and their location on them.
5.Now use shell script step to execute below command
java -jar ${path}\..\pdfjar\pdf_parsing.jar ExtractText ${path}\${short_filename} ${path}\..\Output\text\${short_filename}.txt
you can use hard coded path as well
example:
java -jar D:\1_pdfparsing\pdfjar\pdf_parsing.jar ExtractText D:\1_pdfparsing\pdfFiles\abhi.pdf D:\1_pdfparsing\pdfFiles.txt
In PDI use shell script step to execute above command
Include all step in one Job
Download working copy of above example from download PDI PDF process
https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing
Thanks for share this information, very good!!
ReplyDeletecould't run shell script step.
ReplyDeleteIt is giving the error ('java' is not recognized as an internal or external command)
Please help me regarding this
Thanks
please insatll Java and set Java Home
ReplyDeleteHi. I have JAVA_HOME = C:\Program Files\Java\jdk1.8.0_191\jre, but still got 'java' is not...
ReplyDeleteI'm on PDI version 6.1 with Java8.
Do you have any solution?
Thank you in advance
Best regards
I have the same problem, i have JAVA_HOME configurated, when i write java -version i see the correct result but the problem when run the job, what thinks i need to configurated ?
Delete