Saturday 25 March 2017

How to process PDF file in PDI pentaho Kettle

Process PDF files in Pentaho kettle.
--------------------------------------------
Prerequisite :- 
 Pdf file reader java/jar file required 
Please Download this from below location

https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing

How to read pdf files through Pentaho PDI Kettle.

1.Use Get PDF File Names step to take name of pdf files.
2. you use wild card if if you want to process multiple files (.*.pdf).
3.now use copy rows to result to put pdf filename and their location to stream
   so that this can be used as parameter in next transformation.
4. Create variable and assign value of  pdf filename and their location on them.(refer image below)

5.Now use shell script step to execute below command
java -jar ${path}\..\pdfjar\pdf_parsing.jar ExtractText ${path}\${short_filename}  ${path}\..\Output\text\${short_filename}.txt
you can use hard coded path as well
example:
java -jar D:\1_pdfparsing\pdfjar\pdf_parsing.jar ExtractText D:\1_pdfparsing\pdfFiles\abhi.pdf D:\1_pdfparsing\pdfFiles.txt

6.In PDI use shell script step to execute above command.

Download working copy of above example from download PDI PDF process

https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing







4. Create variable and assign value of  pdf filename and their location on them.


5.Now use shell script step to execute below command
java -jar ${path}\..\pdfjar\pdf_parsing.jar ExtractText ${path}\${short_filename}  ${path}\..\Output\text\${short_filename}.txt
you can use hard coded path as well
example:
java -jar D:\1_pdfparsing\pdfjar\pdf_parsing.jar ExtractText D:\1_pdfparsing\pdfFiles\abhi.pdf D:\1_pdfparsing\pdfFiles.txt







In PDI use shell script step to execute above command





Include all step in one Job





Download working copy of above example from download PDI PDF process

https://drive.google.com/file/d/0B_j1hJxesvxdSVhVUXloVkJybUE/view?usp=sharing





5 comments:

  1. Thanks for share this information, very good!!

    ReplyDelete
  2. could't run shell script step.
    It is giving the error ('java' is not recognized as an internal or external command)

    Please help me regarding this

    Thanks

    ReplyDelete
  3. please insatll Java and set Java Home

    ReplyDelete
  4. Hi. I have JAVA_HOME = C:\Program Files\Java\jdk1.8.0_191\jre, but still got 'java' is not...
    I'm on PDI version 6.1 with Java8.
    Do you have any solution?

    Thank you in advance

    Best regards

    ReplyDelete
    Replies
    1. I have the same problem, i have JAVA_HOME configurated, when i write java -version i see the correct result but the problem when run the job, what thinks i need to configurated ?

      Delete