Monday 6 May 2013

How to Comparing two streams and generating differences in pentaho PDI

Suppose that you have two streams with the same structure and want to find out the differences in the data. Kettle has a step meant specifically for that purpose: the Merge Rows (diff) step.
In this recipe, you will see how it works.
Suppose that you have a file with information about the fastest roller coasters around the world. Now, you get an updated file and want to find out the differences between the files:
There can be new roller coasters in the list; maybe some roller coasters are no longer among the fastest. Besides, you were told that in the old file, there were some errors about the
location, country, and year information, so you are also interested in knowing if some of these
have changed.
Getting ready
For this recipe, you will need two files with information about roller coasters.

Both files have the same structure and look like the following:
Roller_Coaster|Speed|park|location|country|Year
Kingda Ka|128 mph|Six Flags Great Adventure|Jackson, New Jersey||2005
Top Thrill Dragster|120 mph|Cedar Point|Sandusky, Ohio||2003
Dodonpa|106.8 mph|Fuji-Q Highland|FujiYoshida-shi|Japan|2001
Steel Dragon 2000|95 mph|Nagashima Spa Land|Mie|Japan|2000
Millennium Force|93 mph|Cedar Point|Sandusky, Ohio||2000
...
For the There's more section, you will also need a database with the first file already loaded in
a table. You will find a script for creating and loading it also available for downloading.
How to do it...
Carry out the following steps:
1. Create a transformation.
2. Drag a Text file input step into the canvas and use it to read the file top_roller_
coasters.txt. As a separator, type |.
3. Do a preview to make sure that you are reading the file as expected.
4. Add a Sort rows step to sort the rows by roller_coaster and park.
5. Repeat the steps 2 to 4 to read the file named top_roller_coasters_updates.
txt and sort the rows also by roller_coaster and park.
6. From the Join category, add a Merge Rows (diff) step and use it to join both streams
as depicted in the following diagram:






7.
Double-click on the step you just added. In the Reference rows origin: select the
name of the step coming from the stream that reads the top_roller_coasters.
txt file.
8. In the Compare rows origin: select the name of the step coming from the stream that
reads the top_roller_coasters_updates.txt file.
9. As Flag fieldname, type flag.


10. Fill the Keys to match: and Values to compare: grids as shown in the following
screenshot:







No comments:

Post a Comment